Synthetic Data: Noise vs. Quantity

Using generative AI, synthetic data is artificially created from real datasets, both to expand them in general and to acquire more long tail data points. This lets you augment your machine learning models without being restricted by data collection. However, adding more synthetic data also adds more noise, which leads to poorer models.

Greg Harman, CTO of Jaxon, shares his thoughts on how to balance the tradeoffs between noise and quantity.

    Show Transcript


So I’d like to talk about synthetic data and the inherent trade-off in noise and quantity of that synthetic data and how you can use this to improve your first cut machine learning models.

Training a Model

So let’s say that you started training a model with the data you had available, and you ended up with a loss curve that looked something like this.

That is—the training loss goes down and minimizes as you expect, and the test set loss does the same thing, only it levels out with a substantial gap between the loss value. This is loss, and this is our iterations along the X axis. And the question becomes how can we improve this? And obviously if you had more data available, then that would be the way to improve this.

Adding Synthetic Data

But what if I don’t have more labeled data? Then I start thinking about how I could add synthetic data.

Now, there are a couple of different approaches one can take to add synthetic data. And I’m not gonna go deeply into those, but as a couple of examples…

I could add pseudo labels, or synthetic labels, which has the effect of doing something like, you know, provide a Y, given examples in unlabeled X values. If you happen to have a lot of unlabeled data left, and just didn’t have the budget or the time to hand label that, you might take an approach like pseudo labeling. A lot of semi-supervised approaches use this approach.

The other approaches you might think of are generative approaches. Things like GPT-3 or GANs may come into play here. And in this case, all of these different techniques tend to be more like, given a Y, given a label, can I generate a brand new synthetic X?

Just a couple more examples: as opposed to strict generative labeling, you could also do things like unsupervised data augmentation or algorithmic data augmentation, depending on what your domain is. And these are both means of acquiring more data. But what’s going to happen in your loss chart? Because as you’re generating the synthetic data, they’re probably not perfect, and you’re gonna introduce more noise in your original data set.


And so what you’re gonna end up with now is a loss chart, where I’m gonna train as normal. My training looks good like that, and I might have improved what my test set does. It starts to decrease the training—but at some point, what happens is I start to overfit.

Now, I may have potentially made some improvements over where we were up here. But I want to bring this down and account for the noise because this gap here is due to that extra noise that I’ve introduced with my pseudo labeling or generative labeling introduction.


And so the way we look at that are all the different techniques, how do we fix this kind of chart? We want to regularize. And you know, in the end, a lot of deep learning really is just about… apply enough data, and then regularize. Assuming that our model is large enough to potentially model the actual situation. We just need to bring it down and keep it from overfitting. And there are a whole lot of ways that we could regularize, but this includes things like L2 decay. It includes things like dropout, all sorts of different ways to regularize. But when you regularize, you have the effect of reducing the overfitting and try to bring this chart back to something that looks a little more like our ideal chart, which is something like this.

Now, it’s always possible that, even so, you’ll start to get a little divergence here. But that’s okay because we have early stopping, which arguments have been made is in fact just a form of regularization—in particular if you start viewing the duration of your training as itself a hyper parameter.


So there you have it. Here is one way to think about how to deal with the quantity of training data versus the noisiness of that training data, and how you can actually use this to improve your modeling as you interactively continue to refine your model and refine your training process.