Jaxon’s patent-approved SmartSplit Technology is a proprietary means of splitting a dataset (e.g. into training and holdout datasets) in a way that avoids covariate drift and other latent differences between those datasets.

Specifically, it aims to improve upon the standard baseline approach of random sampling given a pre-determined percentage split.

**Greg:**

**SmartSplit Webinar Intro**

So while I am getting the deck going, let me just say hello, I’m Greg, I’m CTO of Jaxon, and I’m gonna be the talker today.

I also have with me, Carly’s our director of marketing, is gonna help keep things moving and shut me up when I talk too much, make sure I don’t miss questions. We also have Brad Hatch with us, our principal data scientist, the primary driving force behind the SmartSplit technology that we’re gonna be showing off today.

**Train/Test Splits**

All right. So today we’re gonna talk about our SmartSplit technology, and, at least we are if the screen show wants to cooperate, and the theme of—the conversation at a higher level before we get into the gory details of SmartSplit—is training and testing split.

So as we go through the process of training machine learning models and thinking about how we’re going to prepare our training data, one of the first questions always comes up is “How am I gonna split off a holdout test set so that I can measure the models that I produce?”

Now, there’s a rule of thumb that most of us use most of the time without thinking too awfully hard about it, which is, let’s take our data, split off 20%, that’s the most common—sometimes we’ll have a big data set and choose to split off 10% because that’s way different—and we use a random split to do it.

And one of the first questions we have to ask is, you know, “Is just taking a rule of thumb like that always the right answer?” And of course, “always” should be a bit of a hair-raising word to anybody because there’s just about nothing that always works.

You know, some of the trade offs… if you make your test set too big, you wasted examples that you could have used in your training set that might have made it perform just a little bit better. There’s something in there you could have learned just a little bit more from.

Conversely, if you, I have two smaller test set, it’s gonna really be a poor measurement of how well your model will actually perform once it rolls out into the into the field. You simply haven’t covered all the potential cases, and so the measurement you’ve made—and you really should think of any model of metrics that you compute things like your accuracy, your F one confusion matrices and so forth areas under curves—there are single measurements out of a latent distribution you can’t see.

And what one of the things you really have to think about is, you know, am I getting the distribution towards the middle of that mean, and how big is that? We’ll get into this a little bit more later. Or am I taking something that’s maybe not representative of what I’m more likely to see in the field? And that’s something that we worry about.

But to bring us back to the question of, before we even talk about how to split it—if we still assume a random sampling, how much should we split off?

Well, I think the answer, if you want to get a little more certain, has to do with—can be inspired by the survey statistics and techniques out of there, wherein you have to establish a confidence bound.

You can’t, you can’t have a, just a magic number that’s always right, that you can say that within a certain degree of error that I’m willing to tolerate in a given measurement within a particular confidence bound. Typical choices might be, I’m willing to accept the 5% error within a confidence of say 95%, 95% of the times my answer will be within that error.

You can start to assemble for a given data set in a given model, how many examples do I need in my test set to be sure that I meet [the] condition bounds, and it turns out that there’s one more dependent variable, which is the predicted accuracy of your model.

And while I understand you don’t know that until you’ve trained it and measured it, generally speaking, you’re gonna have at least some idea of the neighborhood that it’s in. And intuitively, the more accurate your model is, the fewer examples you need to ensure it. Simply because, you know, if you have a model is supposed to be 99.999% accurate, one wrong answer, and you pretty much throw that out the window.

Whereas if your model is 50%, 50-50, you’re gonna have to measure an awful lot of example before statistically, you’re really sure that it is or is not at that level of accuracy.

**Is that all that can go wrong?**

Alright. So we’ve talked about how to split off your data and how big a test set needs to be, and that’s the theme of today. But that sort of under/over splitting is not the only thing that can go wrong when it comes to splitting off your data and making sure that your model evaluations are accurate.

You know, other things happen, data changes. There’s all sorts of drift, the notion of the environments of your data, which is a, sort of a subtle way to describe the nature of your data. What are the examples like are they changing over time? Are they changing with context that you measure it in different systems? Are the predicted values changing?

Sometimes it might be the master set or range of values you’re predicting a change, and sometimes it may simply be the distribution. You’re just naturally starting to see a few more dogs than cats, and maybe that’s because of some sort of drift error, maybe that’s because there actually are more dogs and cats in the particular pile of images you happen to be looking at, or maybe just dogs are outperforming cats in the world because we all know dogs are better.

And conversely, if you have a very, very small sample, you can get bias in particular for your test sets, which is also true of training sets and and under-sampling test sets.

**Concept Drift**

So one of the notions of the way that data may change over time is the idea of concept drift, that is, I’m gonna tell a little story, and I am blatantly ripping this off from a paper by, memory serves, the folks over at Google called “Invariant risk minimization”. And this is something that you can Google that phrase, the paper will come right up.

But the key notion was that some researchers may create an image classifier, and they want to separate out cows from camels, bring up a great classifier that’s pretty accurate, roll out in the field, and it starts misclassifying all over the place as soon as they start taking live pictures. The question becomes, why?

Well, it turned out in this particular case that the training data they had had a bias in it that nobody thought of, and no matter what, they’re going to be hidden biases in the training, and you’re not going to manually be able to think of them all or, or move them all out.

In this particular case, it turned out that all the pictures of camels that they had to train from were taken in the desert on a brown background, and all the pictures of cows were in pastures with, with a green background. And so as soon as the system ran across a picture of a cow standing on dirt or a camel standing on grass, it would misclassify them to—what they had actually trained was a background classifier.

So one of the primary ways to deal with this causal drift, and it is a question of causality? So we classify this as a camel because there was a camel, or because there was a brown background? Is this invariant risk minimization technique?

The idea here is to introduce an intermediate stage to your data. Instead of predicting your outputs, your Ys, your cow versus camel directly from the X, you find what we call an invariant representation. So what you’re actually searching for is some function, W, such that when you transform X by W that it will still make the predictions you want, and it becomes invariant even as things change.

And this is just a little aside on a different kind of drift than what we’ll be talking about primarily with SmartSplit, but I wanted to bring it up just to make sure that we’re covering all the ground in terms of different types of data drift. And if you want to learn more about this technique, again, feel free to go to Google and look up that that particular paper under the keywords “invariant risk minimization”.

**Probability Drift**

So another type of drift that we think about is the notion of probability drift. Here, we are looking at the same X, the same Y over time, we still have pictures of 1000 camels, and we still have cows and camels, we haven’t introduced chickens into the picture, but the distribution is changing. Cows are getting more popular, camels are getting less popular.

Another example of this would be trying to model housing prices in a particular town over time. You can build a model, you can build a very accurate and effective model over housing prices in 1950. Fast forward to today, 70 years later, that model isn’t going to help you at all. It’s gonna be very, very useless.

The houses may still be the same, some of the some of the houses in Santa Clara haven’t changed much since the 50s. The same is true in Boston, many areas of the country, but the prices have gone up, and they haven’t necessarily gone up proportionately because there are other latent factors like the land, their proximity to, you know, this and that in, in the neighborhood and so on.

Now this is something that’s, you know, time is the primary dimension in which things are going to drift, given that the data hasn’t changed. So we’re looking at at something in the real world environment that is changing over time. And the primary way to deal with this is to simply monitor and adjust, you know, there’s almost no such thing as a model that you can create, roll into production, and expect to stay as accurate as it has been over all time.

These are living things, so to speak, and they need to be continually monitored, continually fed with new, updated examples, perhaps sampled out of the the actual live data stream and adjusted just as as these biases start to change over time. And eventually, the model need to be updated or just simply replaced with newer models trained on newer and newer weighted data.

**Covariate Drift**

Alright, so this takes us to covariate drift, and now we get into the area of the world that Jaxon SmartSplit technology is really designed to address.

So in the case of covariate drift, the probability of why the distribution as it were of whether we’re a cow or camel or how, what the median median house prices is, isn’t going to change, but the probability of what we’re going to see in the actual data does. So if you think about cases like—this is rampant in social media as jargon and—lingo and slang changes just as the memes change.

But just to give a bit of a more timeless example, so to speak, you know, we don’t use the word “thou” after about the 17th century. So if you want to try to figure out if Shakespeare could have written this particular passage, does it have the word thou in it? Then maybe. Does it not have that? Well, he probably used that given that he was, he was doing his thing before the 17th century.

And this sort of covariate drift where the sort of—the distribution of the latent features and X is changing over time, but the distribution of our output does not change over time, or if it does change over time, we use these other techniques—is really the core of where the SmartSplit technology helps us the most.

**SmartSplit**

So SmartSplit, at a high level, involves three steps. And again, just to frame this, our stated task here is that we have a data set that we want to divide up by some proportion.

Hopefully we’ve used survey statistics and—to make a wise decision in terms of what percentage of those examples we want in the training set, and for example, example, what percentage of the testing set, and if we’re doing a third validation set, the same holds.

But here we want to actually figure out okay, now which example should I put, which specific examples from the overall dataset should I put in my training and in my test set? And you know, can I do better than random sampling?

So you know the random sampling, the reason it fails, and we’ll talk about a little bit the situation where you might actually still prefer random sampling, but by and large, the reason random sample will fail is because all the examples are not created equal. Some of them will be duplicates, some of them will expose more than one latent feature that the system might be able to find signal in, and some of them aren’t that useful at all.

We want to make sure that we are very choosy about exactly which representations go into which bucket such that they are as alike as absolutely possible, even in the dimensions that we haven’t been able to to see as data scientists and analysts eyeballing the data.

So the three high level steps of SmartSplit are to first create a representation. We call that W(X). This could be something inspired by the variant risk minimization we talked about; doesn’t have to be, you know, I don’t want to confound these two areas of research and technology, although there is an opportunity to marry things up.

Once we have projected our, our dataset X into this representation space W, we’re gonna create a strategic clustering chain over that W, and we’re going to further subdivide those examples into a lot of very strategically crafted clusters, each of which starts to get at a particular topical area or latent feature area/sub area of X.

And then once we’ve done that, now when we do our split, we’re going to sample those clusters, and we’re gonna sample them to fit such that everything comes back out with the expected distribution into Y. So for each cluster, if we’ve said we want an 80/20 split, we’re gonna take four examples per cluster, put them into our training set, take the final example, and put it into our test set, and keep doing that proportionally.

And we have a handful of clever tiebreaking mechanisms when we have uneven numbers in, in a given cluster or very large or very small clusters, and we get into the clustering change and so forth, but that’s getting deeper into the weeds than we’ll have time to address today.

So that’s the technique of, at a high level, how SmartSplit works. So now let’s dig in a little deeper and see what the results are.

**When Not?**

Before we, before we talk about all the times you want to use SmartSplit, I promised that we would talk about when random sampling is still a better choice, and when you want to use it. And in our experience, if you have a very, very large data set, SmartSplit won’t help you so much anymore. You’ve got plenty of data to go around. And the law of large numbers means you’re going to average out between your training and your test set.

What does very large mean? Just empirically speaking, if you were doing a multi class classification problem, maybe something like 300 labeled examples in your smallest class might be a good rule of thumb. Certainly what that means overall will depend on how much skew—if your classes are pretty balanced and you’re doing a binary, then you know, it’s a smaller dataset.

You might have a heavily skewed situation where you’re working against a very, very long tail class, iIn which case, you’d need a tremendous amount of data before it would make sense to do this.

Does SmartSplit hurt? No, SmartSplit is going to do about as well as random sampling even in this case. It’s just an opportunity for optimization, since you can randomly sample presumably faster algorithmically than some of the extra machinery that comes along with SmartSplit.

But other than the situation where SmartSplit takes a little more computation and a little bit more time, we haven’t really found a case yet where you shouldn’t use it. All right. All right.

**Why: Test Variance**

So there are two primary benefits of using SmartSplit. And the first one and the reason we started looking into this direction with the notion of test variants, which I touched on earlier. So, when you take one test set that you split off from the training data to evaluate a model, you’re getting a single measurement of that model’s theoretical distribution over all the collections, all the documents out there in the wild that it might have to predict in production.

There’s actually a bit of a distribution—think about bell curve or Gaussian distribution of the—of what your model’s actual accuracy is, and you’re getting to sample a single measurement that—and that test sets that you use one observation against what your model’s actual performance distribution is.

So, naturally, if we can squeeze that bell curve, make it narrower or, more statistically, if we can reduce the variance of that curve, then we can ensure that the measurement we’ve made is more likely to be closer to that models true performance out in the wild.

The initial thesis here was, “How can we divide things up such that the such that the measurement we make on a model is more likely to reflect what we’re going to see in production with that model, where we’re seeing brand new examples we’ve never seen than the sort of traditional disappointment where something that looks great in the, in the lab immediately falls off in terms of performance when it hits the production realm?”

So we had some very encouraging success on this. I have two graphs here of a couple of different benchmarks we ran. These benchmarks, we, each of these represents a different dataset and modeling problem. And we ran the same problem including random splits on the left side and SmartSplit on the right side using, you know, 100 different times, so we could get a proper distribution and see how does the same modeling architecture perform as we give it different permutations of the data to use for training and testing.

And if you look, you’ll notice that the, in both cases, the SmartSplit distribution on the right side is rather narrower than the random split on the left side. Empirically, when we, we add up the numbers, we seem to be getting about a three times reduction in variance when we use SmartSplit. And that seems to be—hold up fairly well across different data sets, and we consider that to be generally a win.

But wait, there’s more.

**Why: Model Performance**

So it turns out that the models are better too. If you notice back here, not only is the distribution band much smaller for the SmartSplit experiments on the right, but the mean, the median, and the average output of the model which is represented by the red lines here, are also higher. So we’re actually not only getting models that are more robust and our predictions more robust to how they’re going to work out, but it turns out that the models are better.

We’re getting an average of 15% reduction in model error, which we hypothesize has to do with the fact that the model is seeing accurate proportions of the different types of your latent information, latent feature space that is available in the training data.

And it turns out, as if that weren’t enough, that there’s a bonus that things seem to converge just a little bit faster as well.

So we’re finding that our models are achieving convergence faster, maybe again because they aren’t having to stumble across redundant or improperly distributed and skewed examples.

**SmartSplit Webinar Outro**

Alright. With that, I am going to pause, take a sip of my coffee, since it’s still a little early out here on the west coast, and I think we can open the floor up to questions if there are any.

Alright. Seems like I’ve either put everybody to sleep or everybody’s speechless. So with that, I think we are at the end of the webinar. So I’d like to thank everybody for joining us and for listening in. I appreciate everyone for attending.

**Carly:**

If you have any questions, feel free to reach out to us, either via our website www.jaxon.ai or at our email address info@jaxon.ai, and we’d be happy to answer any questions that you have there too.

Alright, thanks everyone. Bye.