About the Product

Agara’s uniqueness comes from
(a) the quality of conversations
(b) ease of creating a new workflow for specific industries. 

 

Quality of Conversations

Agara’s autonomous agents are built to carry out 10x deeper conversations than typical chatbot platforms while holding context. The conversation system is trained on real historic support conversations and can handle any variation that the consumer can up with. Consumers can change their mind mid-call, refuse to provide information, ask questions, provide info without being prompted (things that happen in real conversations) and Agara will seamlessly handle them, gracefully transferring to a representative in the rare cases where the consumer’s question might be beyond Agara’s ability. Agara is also built to learn from the conversations it carries out, getting better over time. 

 

Ease of Creating workflow:

Configuring Agara for a specific brand/use case is pretty much like onboarding an experienced customer support agent on to the new brand. You would only need to provide Agara the high level, happy path flow, and Agara will automatically add in all the unhappy paths and variations that are possible. Providing a happy path happens in the form of simply selecting from an array of pre-built conversation blocks. However, Agara allows for finer-grained control of the agent’s language as well for people who would need them.

Conversational AI is a common term used for messaging apps, chatbots, and voice-enabled assistants or bots which can carry out conversations with humans over a text or voice. Conversational AI empowers businesses to engage customers in 1:1 interactions via text or voice at scale. Interactions may be based on context and past behaviors for personalized experiences. 

Agara specializes in voice conversational AI. Agara builds virtual autonomous customer support agents for certain industries and use cases, for instance, an autonomous agent for account management in Banking. Agara’s conversational AI is built on a stack containing Deep Learning / Machine Learning models built from real customer support conversations. 

 

Conversations Module

Agara’s conversations module is built specifically for Voice conversations and hence takes into purview the complexities of it. It is therefore capable of having deeper multi-turn conversations, adapts to the speaker’s change in context, and most importantly converse naturally and not based on scripts. 

Patented response generation module based on our public state-of-the-art ‘text style transfer’ technique, which ensures the system always comes up with the most contextual and natural responses. The conversational flow module adapts the responses to the customer’s speech, switches in context and intonation as and when required.

 

Natural Language Understanding (NLU)

Agara’s Natural Language Understanding (NLU) engine is pre-trained with proprietary data collected for specific use cases to accurately identify the intent (what the user wants), entities (name of the product, date/destination of travel, etc.), and the tone & sentiment from a user’s speech. 

 

Text-to-Speech System (TTS)

Agara’s TTS is built specifically for business-to-customer communication and to ensure that it doesn’t sound robotic, it is trained using real calls to mimic the speaking style of an actual customer support agent.

Google Cloud Dialogflow is an end-to-end development suite for building conversational interfaces for websites, mobile applications, popular messaging platforms, and IoT devices.

Dialogflow and Rasa are a text-first conversation platform.

Agara is:

(a) Voice conversation agent: Agara is purpose-built for voice conversations and all the complexities that come with it like accents, noise, capturing entities which can have varied spelling, customer expectation of deeper conversations on voice as they typically call expecting an agent. 

(b) Industry context: Agara is pre-trained with the knowledge of specific industries and there is no extra data/training needed to understand the common terminology and use cases. Agara also uses its patented Deep Learning models to squeeze out every bit of accuracy possible from its data. This is not the case with Rasa and Dialogflow where the client needs to provide the data, annotate them, and babysit the process and most likely incurring a huge accuracy penalty compared to Agara.

(c) Agara is not a platform: Agara is built for specific industries and use cases and trained on vast amounts of historic conversations. The client only needs to provide a very high-level conversation flow as they would if they were hypothetically onboarding a well-experienced agent to a new product. They don’t need to enumerate all the possible ways the conversation in reality would go, provide common-sense information, specify what happens if the consumer is not following the happy path or cooperating. This is all taken care of out of the box by Agara’s patented conversation engine which is purpose-built to handle long conversations in a truly natural way. The user is free to change the context, say something irrelevant, ask questions on the conversation, provide info without prompting, etc. Agara is engineered to behave exactly as how a trained agent would at the point. This is something that is not possible by design with platforms like Dialogflow and Rasa.

RPA can do a great job of handling repetitive, rules-based tasks that would previously have required human effort, but it doesn’t learn as it goes like, say, a deep neural network. If something changes in the automated task – a field in a web form moves, for example – the RPA bot typically won’t be able to figure that out on its own.

Still, there’s definitely a relationship between RPA and AI, even if you’re in the camp that thinks RPA does not actually qualify as AI. And that relationship is growing.

AI technologies that augment and mimic human judgment and behavior complement RPA technologies that replicate rules-based human actions.

Converging AI with RPA enables businesses to automate more complex, end-to-end processes than ever before, and integrate predictive modeling and insights into these processes to help humans work smarter and faster.

Agara works purely on Voice and does not develop chatbots. While voice bots and chatbots have some similarities, Voice brings in a host of challenges that are simply not present in chat, and these fall under the categories of (a) Understanding the user (b) the conversation (c) voice that speaks out. 

Understanding the user: 

  • There are 100+ English accents in the world and hence a voice bot for global brands needs to be robust across accents, while the content (text) of what they may be saying could be exactly the same 
  • Customers can call from a variety of noisy environments 
  • Somethings are hard even for human agents on voice channels like getting the customer’s name, identity, etc., because words spelled differently can have very similar pronunciations 
  • Most chatbots can be menu-driven and hence easy to capture consumers’ intent and ensure good user experience, but a menu-driven conversation on speech has a poor user experience. 

The conversation: 

  1. Capturing entities accurately in most multi-turn conversations like ‘Was that A as in Alpha or E as in Elephant ? ’ and the conversation needs to account for it
  2. The conversation needs to be deeper to match the expectation of the consumer who called in expecting a human agent. A very chatbot-like conversation can be detrimental to the user experience. 

The Voice: 

  1. A robotic-sounding bot is bad user experience. Human agents undergo quite a bit of training on the voice and tone. The bot needs to ideally mimic the same for the best user experience. 
  2. Different situations require different tones to be taken. For instance, the tone an agent would take with an irate customer is different from the one they would take with a happy customer. 

Agara doesn’t replace chatbots as chat and voice are two very different channels. Chatbots are a deflection strategy to deal with simple, easy-to-answer queries. Despite using chatbots and menu-driven apps, a considerable number of customers still end up calling, as seen by the massive call volumes. Agara is designed to especially handle more complex queries from customers. This is the case for the industry that Agara concentrates on, validated by the massive call volumes. 

If you have significant call volumes you need Agara. Agara brings with it a leap in conversational voice technology that can instantly handle complex customer queries, 24×7, and can massively scale to handle call spikes/volume uncertainties common in the current COVID situation. And it does all this while keeping the customer experience at the center by providing a zero-hold time experience, consistent and objective messaging, and a truly natural conversation. The significant cost reduction that you get by the automation is purely a by-product.

Agara uses publicly available ASRs combined with its proprietary ASR technology. 

Agara uses publicly available ASRs like Google and Amazon to transcribe generic speech. This gives it the ability to transcribe most words spoken by the user and get the base transcript of the conversation.

While these provide reasonably accurate results and are an excellent way to get started in a project, they will not get the best accuracy than a solution that is optimized for a specific use-case. This is because the machine learning models used by the cloud providers have been trained on generic data rather than domain-specific language. The TTS used right now is only GCP’s. We are still developing our in-house TTS and entity-specific speech recognition.

In parallel, Agara’s proprietary entity-specific Deep Learning-based ASR models are designed to accurately recognize domain-specific information, phrases, or intents. In a mission-critical setup, getting the right facts being provided by the customer is critical. For instance, in an insurance call, getting the policy number and the nature of the problem is central to a great experience.

The entities that qualify for this treatment include:

  • Names
  • Cities, states and countries
  • Numbers (policy number, order number, account number, ticket number)
  • Alphanumeric strings
  • IDs: Passport, drivers license
  • Dates & days
  • Reason for calling
  • And more…

These independent, entity-specific ASR models are hyper-personalized to particular use-cases and industries for maximum impact. For instance, Agara can accurately capture a string of numbers and characters (such as a PNR number or record locator for airlines), which is a hard task for generic ASRs.

These ASR models have been developed in-house and are built to deliver the highest accuracy for specific data entities. They are trained on voice inputs mimicking how customers typically speak these entities (for instance, Christmas eve instead of 24 December). 

Agara controls which of these models are invoked at a given point in time based on what the conversation is about. For instance, when asked the reason for filing a new insurance claim, the customer will most likely provide reasons (‘I had an accident yesterday’ or ‘I already have a claim initiated. I want to add more details’) or ask the bot to speak to an agent (‘I have provided the details. I want to get an update from an agent’). With this context, Agara can invoke the appropriate ASR models to get the best read on the output.

How we use GCP and in-house ASRs together (Additional explanation)

We use accent and language-specific ASRs from GCP to get the transcript of every turn of the customer’s conversation – they’re always ‘on’. — (A)

We also have accent and language-specific ASRs that we build in-house, that are used to get transcripts of certain turns of conversations. We know which turns of conversation to turn them ‘on’  or ‘off’ because we have a decent expectation of what the customer is about to say. For instance, if the customer was asked about the reason for his calling us, he would most likely provide a reason from the set of reasons that most customers provide for that industry and domain (‘i want to cancel my order’, ‘i want to book a new order’, etc.). – (B)

Further, we also have in-house built SLU systems that provide intents and entities from the speech of the customer. Again, we know when to turn them ‘on’ or ‘off’ because we know what to expect from the customer. For instance, if we have an SLU to specifically capture strings of numbers and characters (such as a PNR number), which is a hard task for generic ASRs, it will provide the PNR directly from the speech of the customer, without transcribing the rest of the utterance – (C)

Given that the goal of the system is to get structured entities and intents from the customer’s speech, an NLU system takes in A, B (when available), and C (when available), and applies proprietary methods to ‘combine/reconcile’ these multiple transcripts and SLU-provided intents and entities, to arrive at the final set of intents and entities

This ‘combining/reconciling’ logic is different for different entities and intents, and is either algorithmically written or can also be learned by a machine-learning model.

Load More


About Implementation

Agara works in 2 methods. 

  • It can be integrated with the client’s existing telephony system through a SIP Trunk.
  • On the contrary, the client can also be provided with a standalone number managed by Agara.

Agara is implemented in production environments in a phased approach. 

Discovery phase: The implementation of Agara in production environments begins with a 2-week discovery phase. During this time, we gather data to understand how the contact center’s calls are being handled currently. This data enables us to recognize the issues at hand and determine suitable workflows.

The subsequent phases are design, development, testing, and launch, which are typically between 4 and 6 weeks. During the design phase, we analyze the existing conversations and establish the baseline metrics needed to design conversations. The Development phase consists of data annotation, workflow development, training the speech recognition models, and creating new workflows.

Agara supports integrations as needed – with the existing telephony systems that the client might be using, the CRM systems, and any other transactional systems.

After development, the system is tested for performance, data gathering capabilities, pushing the data back to the CRM to complete the process and other mission-critical tasks associated with accomplishing the goals.

  • Telephony system
  • Any data that Agara needs while handling the call which might be in the Client’s CRM system 
  • Other transactional systems that the client might be using

Agara is a configurable Agent and can be deployed in minutes to any Geography as long as the conversation is in English. Agara supports only English as of today (all major English accents) and support for few other languages is in the works.

Load More


About Voice AI

  • Agara uses robust natural language understanding systems that learn to accommodate for errors based on context
  • Agara builds conversations around clarifying mistranscribed information, keeping in mind that the UX should still be smooth – similar to how a good agent would handle things they have misheard.

Noise environments

We have been dealing with noise-environments for large clients as is typical with customer support calls. There are a few ways we deal with them 

  • Build powerful language models on client-specific data to leverage context and fill in noisy places. This works for cases where things can be guessed from context. 
  • Build specific SLU units meant for very specific important entities and intents that are trained with noisy data. There are 2 ways we do this:
    • Data augmentation techniques to artificially add noise to normal speech. This helps but doesn’t go far enough.
    • Train the models on real customer calls where customers are calling from noisy environments.

Models are trained on a regular basis looking at areas where improvements are needed. They are also additionally trained to manage the specific needs of clients.

A strong source of data and improvements comes from recorded customer conversations provided to us by our clients. These conversations are the closest data to what the voice bot will encounter and are used in additional training of the models. It is important to note that no client data is ever shared with anyone in any form for any reason.

Agara uses publicly available ASRs combined with its proprietary ASR technology. 

Agara uses publicly available ASRs like Google and Amazon to transcribe generic speech. This gives it the ability to transcribe most words spoken by the user and get the base transcript of the conversation.

While these provide reasonably accurate results and are an excellent way to get started in a project, they will not get the best accuracy than a solution that is optimized for a specific use-case. This is because the machine learning models used by the cloud providers have been trained on generic data rather than domain-specific language.

Entity-specific speech recognition

In parallel, Agara’s proprietary entity-specific Deep Learning-based ASR models are designed to accurately recognize domain-specific information, phrases, or intents. In a mission-critical setup, getting the right facts being provided by the customer is critical. For instance, in an insurance call, getting the policy number and the nature of the problem is central to a great experience.

The entities that qualify for this treatment include:

  • Names
  • Cities, states and countries
  • Numbers (policy number, order number, account number, ticket number)
  • Alphanumeric strings
  • IDs: Passport, drivers license
  • Dates & days
  • Reason for calling
  • And more…

These independent, entity-specific ASR models are hyper-personalized to particular use-cases and industries for maximum impact. For instance, Agara can accurately capture a string of numbers and characters (such as a PNR number or record locator for airlines), which is a hard task for generic ASRs.

These ASR models have been developed in-house and are built to deliver the highest accuracy for specific data entities. They are trained on voice inputs mimicking how customers typically speak these entities (for instance, Christmas eve instead of 24 December). 

Agara controls which of these models are invoked at a given point in time based on what the conversation is about. For instance, when asked the reason for filing a new insurance claim, the customer will most likely provide reasons (‘I had an accident yesterday’ or ‘I already have a claim initiated. I want to add more details’) or ask the bot to speak to an agent (‘I have provided the details. I want to get an update from an agent’). With this context, Agara can invoke the appropriate ASR models to get the best read on the output.

Data Collection & Training

Agara’s tech is built on millions of customer conversations – both proprietary to our clients and generated in-house by Agara.

  • Agara works with independent contractors from around the world to generate data specific to its requirements
  • Contractors are employed from specific parts of the world to generate authentic accent data
  • Contractors are asked to record conversations in their natural surroundings ensuring that natural, real-world noises are included in the data sets

Models are trained on a regular basis looking at areas where improvements are needed. They are also additionally trained to manage the specific needs of clients.

A strong source of data and improvements comes from recorded customer conversations provided to us by our clients. These conversations are the closest data to what the voice bot will encounter and are used in additional training of the models. It is important to note that no client data is ever shared with anyone in any form for any reason.

Delivering Results

All these details matter only if the results are made available in a fast, usable manner. To run the multiple ASR models as well as the subsequent NLU models, Agara does a couple of things:

One, entity-specific ASRs operate directly on the speech input and do not require language models to create their output. The result is much faster processing than what public ASRs like Google and Amazon do.

Two, Agara processes its models on GPU machines for fast responses. A high-performance Nvidia Tesla v100 GPU cluster runs ASR models as well as NLU models. The average processing times for Agara ASR engines (external and entity-specific combined) are ~350ms (1/3rd of a second) which make for a seamless, natural flow in the conversation.

Load More


About Agara

Spoken Language Understanding (SLU) from speech directly:

The need for SLU arises because transcribing the entire speech of the customer accurately (using a traditional ASR) will be error-prone because of noisy environments, quality of the transmission of voice over the phone, etc. More importantly, conversational systems are more interested in intents and entities than actual full transcripts. It’s easier to do this because the set of intents and patterns of entities provided will be a restricted set for a particular domain and industry.

The research we are doing is to build these SLU systems that are specifically tuned to accent, language, and (more importantly) domain/industry. This will allow them to be highly accurate.

One idea is to use ‘speaker embeddings’ that are deep learning models’ internal representations that contain information about speaker characteristics such as accent, gender, etc. When speaker embeddings are learned by an SLU model, they understand how to use them in interpreting speech better.

Another angle is transfer learning of speech, where we leverage pre-trained ASRs, trained on large public speech datasets of available accents (primarily American English), and fine-tune them further with a small set of accent-specific English (South-east Asian, British, etc.) examples to now be able to understand these accents well too.

Reducing latency is also something we are working on. Using GPUs and parallelization techniques, we want to invoke multiple ASRs / SLU modules simultaneously to extract the various intents and entities.

Conversation (text):

Adapting the bot’s response to the context of the conversation, emotion/mood of the customer, demographics of the customer (age, geography, etc.). Current conversational bots are very poor at this since most are pre-programmed to reply in a static fashion. The broad areas of this research are ‘Controlled Text Generation’ and ‘Style Transfer’. 

Engaging the customer in natural conversation, which includes answering questions in the context of the conversation. The broad area of NLP/NLG research this comes under is Question Answering Systems. But our focus is on Conversational Question Answering. This involves answering customer questions based on external knowledge sources such as FAQs, etc that might not be very structured and will not be written conversationally. The answers to the customers’ questions should be conversational and modified to fit the context though. An example of work in this area is https://arxiv.org/abs/2006.03533.

Reference links:

“Transforming” Delete, Retrieve, Generate Approach for Controlled Text Style Transfer

Akhilesh Sudhakar, Bhargav Upadhyay, Arjun Maheswaran. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. (104 kB)

https://www.aclweb.org/anthology/thumb/D19-1322.jpg

The Generative Style Transformer

This post explains our paper on style transfer, “Transforming Delete, Retrieve, Generate Approach for Controlled Text Style Transfer”…

https://miro.medium.com/max/1135/1*0Dg1tLXmZyYrFHxtDkaLMw.png

Load More