Customers send emails to companies all the time. The email might be questions they have about a product, complaints about a recent product failure, looking for coupons etc. These emails are then looked at by customer service agents on the company’s side, who would read the email, collect relevant information from internal knowledge bases, and then respond to the customer. Having the important parts in the email highlighted for them would make the customer service agents a lot more efficient and able to provide higher quality service with faster turnaround times.

At Agara Labs, we had a dataset of historical customer service emails. We also had data on what type of email it was, a question on a product vs complaint about a product for instance. These categories were provided by agents themselves when handling the case, so it was very accurate. With this dataset in hand, we sought to find a way to add highlighting of important parts of the email as a feature to our product.

Given just the email text and no other information,

(example based on a random review from Amazon)

we wanted to highlight the important words/phrases in the email for the agents.

What constitutes an important word/phrase is subjective and depends on the agent and the type of product. A bad user experience might cause the agents to ignore our highlighting, rendering it useless. So we wanted the system to have a few important features:

  • The ability to tune the system easily on our end to suit the agent’s needs. Being able to highlight more or less, for instance
  • The highlighting needed to be fluid and smooth with enough context, kind of like summarising important parts than just highlighting a word or two, here and there
  • The system needed to have very high Recall with considerably high Precision. If we missed important parts, then the agents won’t be able to rely on our system and might even start ignoring the highlights. On the other hand, if we highlighted a considerable number of unimportant words, especially if not in the context of an important part of the email, then the agents might get frustrated and might start ignoring the highlights

We started exploring the problem by using a couple of simple count-based methods such as TF-IDF and log-likelihood-ratio. The highlights produced were a little underwhelming. Moreover, any purely count-based method would not be able to generalize to words/phrases not seen in our historical dataset of emails. However, these methods served as good baselines as we built better solutions. The rest of the post describes an approach which worked very well for our use case.

Segment Embedding

We split the emails into smaller parts which we called segments. Each segment was typically an n-gram (say n = 1 to 3) but with small constraints. One constraint, for instance, was that a segment cannot span over more than one sentence.

The idea is to map each of these segments to an embedding space and figure out which segment embeddings are most predictive of the category of the emails. For mapping the segments to an embedding, we used pre-trained word embeddings. Pre-trained word embeddings such as Glove/Word2Vec would let us more easily generalize to words we haven’t seen in our historical dataset. Moreover, these embeddings having been trained on billions of data points, immediately add information on word similarity to the model, helping with prediction performance.

The embeddings of individual words are averaged to get the segment embedding.

To get the segment level embedding, we simply averaged the word embeddings of the individual words in the segment. Since arithmetic mean is a summary statistic, the average embedding acts as a nice summary of the segment. The mean word embedding usually works well for short texts, especially when the order of words aren’t important to the prediction problem. Both were the case with the email segments, so this worked perfectly for us.

We considered each segment in the email separately and obtained the segment embeddings for them. These were then passed as input to a Multi-Layer Perceptron which consists of a couple of hidden layers with ReLU activation, Dropout for regularization and a final Softmax layer. We trained this model to predict the category of the email and also fine-tuned the word embeddings on this task. The segments which are most predictive of the category of the email are likely to be important ones.

Highlighting Segments

When we get new emails, that we want to extract important words from, we run each individual segment in the email through the trained model. From the model’s prediction output, we get the maximum prediction score across all the categories for each of the segments, sort the segments by this score and select segments whose corresponding score was above a predetermined threshold.

One thing that was a little surprising to us was that, without being explicitly trained for it, the highlighting produced by the model was quite smooth. It wasn’t just highlighting words here and there. For instance, two individual segments were quite likely to form a contiguous smooth highlight. We also found that bi-grams were the most effective in producing this sort of smooth highlighting.

Even though the highlights were already quite smooth, it was still rough around the edges. So we came up with quite a few rules to smooth out the highlighting even further. For instance, if there were 2 highlighted segments with just one un-highlighted word in between, then we probably want to highlight that word as well, so it doesn’t look cluttered.

With the combination of the model and hand-written rules, the system was able to produce high-quality highlights that were not immediately distinguishable from human-produced ones for the majority of the cases.

Few things …

While so far we only talked about the category (complaint vs question) the email belonged to, we actually had data on what product was mentioned etc. Using these extra dimensions helped with producing much richer highlights.

We also used a few regular expressions and hand-written rules to extract customer information like name, phone, email, location etc.

While segmenting the emails and averaging word embeddings to get segment embeddings has worked very well, it is only one way of doing this. Using a recurrent architecture, such as an LSTM, coupled with an attention mechanism where the attention weights indicate the importance of words is another possible way. We hope to explore this in a future post.

Luckily, we didn’t have to shut down any rogue AI when we were building the model.