Customer support as a function is holistically driven by human interaction and natural language conversations and thus has a lot of scopes to explore voice technology.

Voice technology implies the automation of voice conversations, wherein a human interacts with a machine or a Voice AI assistant at the other end of the line. Human interaction comprises natural language and multiple ways of uttering the same content. People have accents, varied levels of proficiency of the English language, and very often poor audio & network quality while making customer support calls.

With so many natural errors, it is a mammoth task to train computers to understand a human’s natural way of conversation. Recent advancements in the field of Deep Learning has made it possible to build models that are nearly as good as humans (or even better in a few cases) in understanding a user’s utterances.

Deep learning can be used for many real-life use-cases (with minimal changes) such as classifications tasks, entity extractions, sentiment analysis, and summarization to name a few. However, achieving human-like conversations has proven to be much more challenging for these models. Though deep learning research communities articulate this as a seq2seq problem and have built datasets around it to measure the progress in the field.

The latest research shows that the deep learning models can be reasonably good for small talk types of conversations but they are far from an ideal state that is to directly apply for building voice AI agents. Even the most popular voice AI assistants such as Alexa, Google Home, and Siri work best when given simple commands but are still far from carrying out a natural conversation.

After analyzing hundreds of calls and working on multiple POCs from different industries, we came up with the list of challenges one has to address for creating a goal-oriented, end-to-end human-like conversation for voice AI agents.

  1. Ease of updating the conversation flow 
  2. Building new conversations with minimal data
  3. System’s ability to improve the response overtime
  4. System’s flexibility to handle user questions
  5. Simplifying complex information

Deep Learning-based conversation systems (AKA Task/Goal-Oriented Dialogue Systems) are only dependent on conversation data and mainly trained with Maximum Likelihood Estimation(MLE) objective or Reinforcement Learning (RL). Thus such systems require retraining even for the slightest change in the conversation, which makes it a very expensive task in terms of data preparation and training. Moreover, many conversation responses are based on business logic, which can be complicated to express in many industries (i.e. Finance, Technology), Current Deep Learning models are not capable of decoding such logic only with text data as input. The robustness of such systems is quite questionable because even models from the latest research have shown unexpected behavior on unseen data.

Agara has built an in-house framework keeping the above points in mind. Here are some examples for each challenge.

1. Ease of updating the conversation flow:  

Customer support has varied products and their respective standard operating procedures to resolve issues for these products. These processes are tweaked very often.

The modifications are usually a few sentences only but can be placed anywhere in the flow. For example: In the airline industry, updating customers of new information regarding the flight status, such as cancellation due to a COVID case… Or in the Retail industry to inform customers to expect a delay of shipment to a retail customer because of some natural calamity.

The time required to deploy such changes is very crucial for a successful voice AI agent. Agara’s framework has the capability to modify the conversation flows with ease for such smaller changes. As the framework provides flexibility to add such messages in the workflow. Even if such changes require adding some business logic, the framework can easily accommodate it with very quick turnaround time.

2. Building new conversations with minimal data: 

In customer support, it is very usual to expect to build a new conversation around existing products. i.e. building conversation for cancellation of retail products/ flight tickets where current conversation can support inquiry. For deep learning-based systems one has to generate conversation data for all possible cases and merge it with existing data and re-train the Dialogue Manager with it. This is a very time consuming and expensive practice and surely not scalable.

Agara’s framework has been developed in a way that such conversations can be quickly built from logical workflows with minimal conversation data. For a quick start, the AI agent’s responses are tightly coupled with intents and entities extracted from user utterance and the current state of the conversation. Once deployed it offers the basic functionality with high robustness.

As Agara has exposure to multiple clients from the industries, all the basic functionalities like cancellation already exist in its basic form which can be simply reused by providing use-case specific parameters, such as company name, cancellation policies, and data APIs.

3. System’s ability to improve the response over time: 

Just as humans gain experience at tasks which they do regularly, they are able to handle more complexity with ease. The same is expected from a voice AI agent. It should be able to provide responses that not only make logical sense but also as per the user on the call.

For example, in the Finance domain, users can have varying degrees of knowledge, thus for the same question, the AI agent’s answer should be conditioned based on the user’s proficiency in the domain. The same applies to different characteristics of users like sentiment, age, region, etc. 

One of the most important parts to achieve it is through a controlled natural language generation. Agara has already published its work in the domain at EMNLP 2019 and has filed a patent which achieves State of the Art results.

4. System’s flexibility to handle user questions: 

In a customer support call, voice AI agents guide the caller based on the purpose of the call. For example, if a user has called to cancel a flight ticket, the voice AI agent collects booking details, confirms with the database, and informs the refund amount. But users can have questions about refund policies, etc. Efficient voice AI should be able to answer all such questions if the information is available to it or help users about where they can find it. 

Agara’s framework has the capability to store the current state of the conversation and answer the questions by calling API or using unstructured domain knowledge. Once the questions are answered it resumes the conversation from the point it was left.

5. Simplifying complex information: 

The reason why few tasks such as booking a flight or buying a thing online takes more time because one has to go through many options and do some comparison and make a final decision. To have a great user experience it is important to have proper mechanisms to present such complex information in a simple way. Thus using some naive template based methods can create bad user experience. For example, in case of flight booking, there can be multiple flights available for a given date and cities. Simply reading out all the options can lead to very bad user experience. 

Agara’s framework has diligently designed for such cases and has built conversation around it in a way that users can easily get what they are looking for. 

Agara’s conversation framework for voice AI has been designed keeping a few points in mind, which are ease of use, minimum data requirements to start with, easy to modify, great user experience and continuous robust responses. For these, we not only explore and build state of the art technologies but also understand its limitations in achieving our design goals. Also we invest heavily in applied research which can help us to improve performance of the system and help us to be the best in each design area.