Conversational data has a unique structure—call logs and transcripts are hierarchical, with each statement relying on the order and content of the ones before. In this talk, we discuss ways to create effective call transcript classifiers that leverage automated data labeling and synthetic data generation.
Jaxon’s approach to handling call transcript data involves a 3-component neural model:
- Text representation
- Attention layer (coordinate different utterances in a single dialogue)
Also, we show how Jaxon can help you create models faster with techniques like unsupervised data augmentation—without compromising the model’s integrity.
All right. Let’s get this kicked off. So we’ve been building Jaxon over the past five years or so, and as new whiz-bang techniques have come out, we’ve incorporated them in. So you can think of us as a training harness that is multipurpose, is training together a bunch of different techniques into a pipeline that is operationalized.
Operationalized, meaning that your data scientists don’t need to cobble these algorithms together in one-off notebooks that aren’t reusable. We have all these newer techniques in a production ready system. Techniques like augmenting data, which I know Brad will talk a lot more about. Techniques like transferring learning that you have from other models and pre-trained, what are called embeddings, or data if you.
We also have embraced active learning, which optimizes human time. If a human is going to be spending time working on a model, we wanna make sure that we are aware of the cost of that human’s time and direct them to the highest and best use of their time, which could be labeling some examples to seed Jaxon.
It could be accuracy voting on examples that—labeled by Jaxon. Um, it also could mean writing in rules into the equation that Jaxon can use, all toward the end of creating this training data on the right hand side here.
Thanks, Scott. So this webinar is about how to handle conversation-type data, and conversation-type data is unique in its own right.
It’s a little bit different than just trying to take a chunk of text and classify it because it’s, it comes in a different kind of structure. For example, it is a sequence of sequences. And what’s so special about that is that each line here in this conversation, each line is a sequence where the words in each line are in a particular order—but not just the words in, in each line, but each line of each utterance in this conversation is also in a particular order. If we had all of Frank’s utterances first, and then all of George’s utterances last, the conversation probably wouldn’t make much sense. And so this type of, there’s a hierarchical structure to this data, and it’s important to take advantage of that hierarchical structure.
But conversations and call logs and transcripts are not the only things that come in this. You might think about some of the other types of data that you have such as… network data is also organized in a hierarchical structure. When you have two machines that are talking to each other, they send data packets to and fro, and a collection of those data packets is called a session.
And you can, within each data packet, you can think of it as each data packet as being an utterance. All the data in there is in a structured format and how machines start talking to each other is also in a structured format. If, typically it’s the workstation that starts to talk to the server instead of the server initiating the conversation to the workstation.
And we did a project in this to try to find nefarious or just weird conversations happening on the network, and we had a lot of success in modeling it as a dialogue. Some of the methods that you can, some of the ways that people approach tackling this problem—the first is, you can just take every utterance and just concatenate it all.
And a lot of times we have some kind of token or separator pattern that indicates that this is the start and the end of a new token. And here, it’s just a made up pattern that we use to tell the model when a conversation or when an utterance has stopped and when one has started.
But what happens is that the model needs to turn this entire concatenated conversation into some kind of representation. And so this whole conversation gets one representation. Uh, another way of looking at this is keeping each utterance separate and instead having the model learn how to combine all those different representations, so once it turns the text into numbers so that the model can use it, and so that the mo the computer can understand it, we now have to figure out a way to combine all of these and, and we can do that with something called an attention layer.
And the attention layer is going to learn. How to value each of these representations, each of these representations in pink. And so it’s gonna assess them one at a time and indicate how important it is in the final task, which here is just, we’re just gonna try to classify this as a positive or negative conversation.
Or if it’s call log transcript, we’re gonna, we’re gonna classify it as… the customer was satisfied with the conversation, or was dissatisfied, or there are different levels of satisfaction. And if the attention, and the special thing about this attention mechanism is that it’s learned and it can learn.
So if each line in this conversation was of equal importance to the final task, then this attention mechanism would weight it all the same. And we would have what is essentially the mean of all of these representations to combine them into one final conversation representation. But what the attention layer does is it actually can learn how valuable each of these are.
And then what we do is we take a weighted sum of all of those representations so that we have a final conversation representation, one representation to rule them all. So now we’re, we’re placing more weight on areas of the conversation that might be more important, which is a really fun idea that this can be learned.
It also, it kind of makes the black box of a neural network a little more gray, or a little less opaque, because it’s—
You can look at these weights and interrogate the model and say, “what did you focus on in this conversation the most in order to give us your prediction?”
And in—and I should mention, we’re gonna look at we’re gonna look at Jaxon, and we’re gonna look at how this, how this all fits in Jaxon, and each of these things that we’ve described, we kind of have a, we do it, but we also have a special sauce version of it. And so let me take you to what you talking about.
I just wanna share with those that don’t really understand attention layers and what happens inside the neural network. I like to dumb it down to just thinking about it as math. And in between each node of a neural network, you can think of the math happening in the form of a weight or a bias.
And the attention layers are looking at parts of the neural network that are specific to a particular task at hand, and the attention between the neural nodes is how it processes the data and makes the ultimate decision of, is this what I’m looking at or not… the label, if you will.
So here is this is Jaxon and we have several tabs up here and we’re just gonna, we’re probably gonna go over a few of these tabs very briefly and we’re gonna focus on.
Mostly where I spend my entire working day is on this neural tab, but right here is is the projects tab, and maybe I can even zoom in just a little bit. Um, and there we go.
So this is the project’s tab, and this is where we just start a new project and we like to start new projects where we want to keep our models separate when we have different data sets.
One, this one might be about dialogue and conversations. The next one might be about just classifying recipes and we can start the project here. Once we start it, we go to our data set and we import our data and the specification tab, lets us pick our classes and make sure that they—that the classes are represented in for our classification task.
And that traces back to this dataset tab. The special thing about this is that once we import a data set, and you can see that in these examples, I use the same separator as I did in the slide presentation part. And that’s gonna become important because we need a later—we need to, there’s so many different ways and so many different data sets that try to express how a conversation flow goes.
They could be in JSON format, it can be XML, it can be however. So it all needs to be concatenated with some kind of separator. And I just chose this one because… I figured that pattern wouldn’t show up. And it’s funny how it caught a lot of my colleagues by surprise. Like, what is this thing?
So you can see the separators, but there’s also, when I first, I only had one data set when I ingested this. I had, I just had this original data set, and what I did is I went to the split dataset tab. And you can randomly split your data set into two—into a train and test or even a validation data set.
Or we have something that we call SmartSplit, which intelligently—and tries to make sure—ensure that your train set distribution, not just the frequency of positive and negative calls, but also the type of positives and the type of negatives, are evenly distributed between the train and test. And in doing so, we were just hoping to reduce the variance of what a model produces.
Um, if it can, if the two distributions are the same and it did it, it drastically reduced it, but it also increased the accuracy that we were looking for, the mean accuracy, which was a nice little bonus.
So we can split our data set. And then in the labels tab, we can, this is where—this is a collaboration tab, where if you have unlabeled data, as we do here in in our original, any of these “None”s is just a, an unlabeled data piece.
And in the labels tab, this is a place of collaboration where everyone can choose labels. We can even vote on labels. Uh, you can just really geek out about how one data piece should be defined, because that can have a drastic influence on your model accuracy. Scott mentioned—
Yeah, this is, I was gonna say, I mentioned during my spiel, this is where the human time is really spent, and you wanna optimize their time by picking the best task at hand, which is what we’re calling guided learning.
One more thing I wanted to clarify on—when it comes to the problem specification, in this particular example, we have two classes, positive or negative. But when a data scientist is sitting down to design these models, the classes could be, well, not endless, but it could be in the dozens or hundreds of different classes that you want to have Jaxon figure out what label to apply to.
Scott mentioned earlier that you can even write a function or a RegEx expression to look for a certain—to be its own classifier. And so if we were, if we had recipes, we might write a RegEx expression here that says, any time it says—the recipe says salsa, I want you to classify it as Mexican.
Now, these heuristic or these rules that we write for a classifier, they can all be they can all be used and evaluated on their efficacy by an underlying technology called Snorkel. So Snorkel will take many models and they’ll take the many model’s predictions, and it will learn which one to pay attention to and which one to ignore based on the input example.
So you can write your own basically model, which is just a RegEx pattern looking for a specific class.
I think it’s an important point to drive home is that rules of yore or heuristics that people have used in the past have been steadfast rules that you have to listen to.
With this approach, they are only used when they’re relevant, and the ability to abstain is really key to the operation.
So then there’s a—there’s also this tab of classical models. And this is your random forest, your support vector machine. And we are not of the opinion that neural networks are always the best choice.
There is no free lunch here. So we have a classical tab that, for those occasions, now, it is exactly—
We’ve found that neural usually rules the day, but there are some datasets and some use cases where classical approaches are still valid, and we’re gonna marry them all together anyway in the ensemble tab.
So it just has a different view on the same data that the neural networks might have missed or looked at a different way.
Yeah, and I’m, I was obligated to say that neurals are not always the best because actually the neural tab is where I spend my entire day. Uh, this is where we can run different neural models and soon, I hope it’s okay that I say this, but soon we’ll be able to—we’re gonna ingest, you can ingest your own neural model and have Jaxon around it to really amplify its effectiveness.
And one of the ways that we, so here’s a neural model, model one.
Let me just add to that or clarify. Right now we’re building models inside of Jaxon, and we have supported a number of different architectures that are in the open domain. What Brad just said is that coming soon you’ll be able to create a model outside of Jaxon and bring it to Jaxon for training.
Yep. And uh, here we’re just gonna pick our test set and the models. We have a few, we have a couple recurrent neural networks. This AWD-LSTM is a Fast.AI product. It’s there, it’s their LSTM. And we also have DistilBERT, which is a smaller version of the giant model, BERT, and the uh, and so we select the models, and you’ll notice that as soon as I selected DistilBERT, we have this new option down here, which is “augment data”.
And we’ll talk about that in just a second. But I’m just gonna set this up. We have uh, we’re gonna do on the train… and maybe I’ll zoom out just a little bit…
So we can see that if you had multiple data sets, you can create multiple. Um, this would definitely be cheating, ‘cuz the test set is from the original.
But you can create multiple stages of training. And you can also, we could also… And if I had a group, there’s even a pre-train. We can even pre-train it and start off with that stage first, and then go to a training stage. Sometimes this is very valuable, especially when you have time ordered data and you know that more recent data is more valuable than past data.
And so you wanna train on the past data first, and that’ll be one of the first stages. And then you train on the more recent data for the very last, cuz that’s more like your current data. Cuz we, cuz sometimes our data suffers from data drift and that the, like, the stock market, Apple’s evaluation or Apple’s price of their stock is much different in the year 2000 than it is right now.
And so we have these different training stages that we can do—or if you wanted to, once we can ingest the model, if you wanted to train on new data, take an old model and train on new data, that’s where this could happen.
So one thing that is that I like to call this the levers to pull within Jaxon, or—
A part of—or an aspect of Jaxon that I don’t think we’ve mentioned yet, is that we really embrace an iterative approach to model creation, where you’ll try one pass here, maybe without augmentation, maybe without pre-training. Then you can come back through, and just by selecting or deselecting these options, you can try out different approaches with the same data and maybe even the same neural architecture. But… by iterating, you can keep improving and finding the the balance of precision and recall that you’re really going after.
And some of the just the general model levers that you can play with… the learning rate weight decay.
We have this labeled as long form document handling, but this is where the conversation, this is where we handle conversation type data, and we assume that there is some kind of pattern that is separating each utterance or each line in a conversation.
And we call it “long form” because we can also separate an entire document into its paragraphs, and then we can combine the information from those paragraphs later through that attention mechanism. It’s exciting because our—one of the attention mechanisms that we have inside of Jaxon is based on entropy, like using entropy explicitly, not just to update the model, but to also evaluate where most of the information might be in a conversation.
And we just take out this pattern and then we put in our beautiful, um, “make sense” pattern here of pipes and special characters. And we say that’s a, that is our, that’s what’s separating each of these. Now, the data set that I have loaded right now is the Frames data set.
And it’s small. And it’s small because it’s severely imbalanced. So the number of positive reviews of customers of an interaction with a call agent is much greater than the negative. So I balanced it out, but in doing so, I lost a lot of the—I turned a lot of the positive interactions into unlabeled.
And so one way we can handle a smaller data set, if we have a lot of unlabeled data—and the key is having a lot of unlabeled data, and usually, that’s very inexpensive to get—is that we have this little checkbox, and it that checkbox does not do this justice because the augmented data set, what that does behind the scenes is you have—and we won’t go too much detail of it—but it is unsupervised data augmentation.
So the supervised training is happening with the labeled data and the model as one part of the training, but also an auxiliary task that’s happening, is that our unlabeled data is, for each unlabeled piece of data, we alter it and create this synthetic, almost like clone. I don’t… a Michael Keaton movie’s coming to mind, but we have this, we have the original unlabeled data, and we have a copy of it, but that we’ve distorted a little bit, and you can think of those distortions or—synthetically generating a new one based off of the original—as like random word replacement. Maybe we just randomly replaced 20% of the words with just other words in the dictionary.
That’s the most basic way of augmenting text. It’s not super effective for the supervised side, but for the unsupervised side it is, makes a drastic difference cuz all we care about is that the model takes that pair, the unlabeled and its synthetic augmentation, and we just care that it predicts it to the same probabilities.
We don’t care if the probabilities are correct, we don’t care if, we just want it to predict the same probability. So, and this is all unsupervised. So we have two things going. We have the supervised part and the unsupervised part, making the model much, much more accurate, more—so accurate that we can take the IMDB data set and we can only use 20 examples.
And the movie review data set is trying to classify movie reviews as either positive or negative sentiment. And we can just take 20 examples and get about 93% accuracy on that with just 20 examples. Now, we have 75,000 unlabeled examples to help the the model on the unsupervised side, but you don’t have to label those 75,000.
So data augmentation is a very cool, cool way of trying to amplify a little bit of data. So this is one of the bottles with just standard attention, and we got about, we got about an F score of 0.82, and when we move to entropy—
—as far as synonymous with accuracy.
Yeah, yeah, yeah. For sure.
And the neat thing is that we have this confusion matrix down below, and one thing that I like to do is I like to look at, if I click on one of these squares or hover over, I can see the actual—and we can see that this test data set is pretty small. We can see what it’s actually getting.
And we can also take a look at the things that it does, some samples of things that it does get right and the things that it does get wrong. So if there’s a pattern in the things that the model is getting wrong, maybe that’s a, that’s a tip to, maybe we need to go back to the labeling tab and relabel things, or maybe clean up the data a little bit if, um—and then you can see you have a record of all of your hyperparameters that you used for this.
And one thing that I use quite often is this “copy”, where I can just copy all those hyperparameters and it just copies everything that I had down there so I can rerun the model again. If I’m trying to really, if I go, wow, was that just a weird coincidence or can I repeat the experiment?
And these models can, once you have it, we can actually export it into a prediction server so that you can send—so that the model is already trained. And if you have new examples coming in over the next week, month, or however long, you can send new examples to the prediction server and it will return a result, and you can classify and label your data that way as well.
Okay. Once you have your models built, we ensemble them together and once they’re ensembled, you can label up all the unlabeled data and that can then be used in another turn within Jaxon. So use that to train another model, or you can simply export that label data set, and you’re off and running.
And. Again, thank you, Brad. Thank you all for joining, and cheers.