Data Augmentation Webinar

Data augmentation is the process of creating new, different data points by slightly altering old ones. It’s a powerful data science tool that helps expand datasets and cover edge cases.

In this talk, we cover what data augmentation is, how it works for images and text, and why text augmentation is so much more difficult than image augmentation.

    Show Transcript

Scott

 All right, we’ve hit the the noon mark, at least here in Eastern Zone. Welcome everyone. Welcome to the Jaxon Show. I’m your host, Scott Cohen, founder-CEO. Greg Harmon, here with me, co-founder-CTO, and Brad Hatch, our principal data scientist, who will take the floor momentarily and run us through data augmentation.

Just wanted to start out by sharing that our goal as a company is to eliminate as much of the human, or at least the rote manual work around data labeling and creating training data for machine learning models as possible. Uh, we’ve breathed in automation to a number of the steps from active learning to concepts around weak supervision and generative modeling to what we’re gonna highlight today—data augmentation.

Uh, when we talk about data augmentation, we’re trying to add in the capabilities to both generalize and help the training set cover the edge cases or the manifolds of the neural net. With that, let me turn it over to our expert. Brad?

Brad

Alrighty. So kind of the general… I get… it’s not really a story arc, but can—the general progression of this, what is data augmentation? Why do we need it? And what does it look like in images, and why is that easy? And what does it look like in text, and why is that hard?

So—starting out with: what is data augmentation. The main point behind this slide is… deep learning needs a lot of data, and sometimes what we have available is not enough, even if it’s a million examples.

And so what data augmentation does is it creates a larger data set and a more diverse data set by generating extra examples based on what you already have. And this could be done through two general methods. One is just: somebody writes a function that takes an example and alters it a little bit, like uh, I’m gonna just rotate the image a little bit, and that creates a new example for the model to learn from.

The second way is for the—to have a second model learn all the characteristics of what data you have and then try to mimic it by generating similar type of examples. Uh, as you can imagine, the latter is a much more difficult task than than just writing a quick function to take one example and turn it into another.

Greg

Yeah. In essence, the second way is the GANS, versus… the first way is just, like, Photoshop techniques, right?

Brad

Yeah, yeah. PyTorch comes with a slew of first examples.

Um, so let’s—first, let’s talk about image augmentation. And these are just simple functions that take one example, like this picture of this cat here, as the—that’s the original image, but we can create new examples by just altering that original image a little bit. The first one is a rotation; the second one, if you can notice, we flipped it—the tail is now on the other side; the third one is gray scale. We can enlarge it and then crop it so it’s still the same size.

And then we can finally at the end we can adjust the color hue, the brightness and the contrast. And each of these is a brand new example to the model, the model that has never seen these examples before. Cuz certain things are in different places or they’re different colors.

Scott

All right, I’m gonna cut the flow now. I’m curious.

So, in Photoshop, which… played around with, I know there are so many different options. How do you decide which ones to do, and how many to do? Like, if there’s like a one—like if there’s only a small variant on, say, hue, are you gonna do every notch of the settings within Photoshop, or are you gonna do, like, every 10?

You know what I mean?

Brad

Yeah. Well, there’s some, there’s some you probably want to steer away from. Like, notice the second one was flipped on a vertical axis, so it what was on the left now becomes on the right. You probably don’t wanna flip it on the horizontal axis cuz that’s not indicative of a real world example, to have an upside down cat.

Scott

Sure.

Brad

And, um, but usually, there’s a standard set of augmentations, and PyTorch and TensorFlow have those already programmed in, so you don’t need to use Photoshop—this happens during training.

Scott

That’s for like, say, twisting the image, right?

Brad

Yeah.

Scott

Let me ask that question. So, you had… I guess you don’t have 360, like, 180 degrees to turn image. Do you do a click on every degree?

Brad

No, it randomly will select a degree between, like, 0 and 45.

Scott

One through zero, zero through 45? It wouldn’t be like three or four in that range? It’s chunked up in 45 segments?

Brad

So every time this original image is presented to the model, you can do a random transformation to it.

Scott

Random.

Brad

Yep. Random. So you randomly select to rotate it, and then you randomly select to rotate that by anywhere between one and 45 degrees or 90 degrees or 180 degrees somewhere in there. So the amount of different combinations of these augmenta—you can combine these augmentations as well.

Scott

Right.

So that answers my questions about hue. You wouldn’t do every click, you would do two or three per image.

Brad

Yeah. You’d give it yeah.

Scott

You would do each, each option separately. Like you’d do two or three variants on contrast, two or three variants on saturation, that kind of stuff.

Brad

Yeah. And it, and it, so you, what you’d do is if you had a list of 10 different ways to augment this image, like, hue, rotate, flip, you could randomly select three. And then within those three, there’s usually a range that you can randomly select from.

Scott

This is through PyTorch?

Brad

Yeah. Yeah. And TensorFlow, and any deep learning library.

Scott

Right. But presumably the sky’s the limit if we wanted to do this outside of PyTorch.

Brad

That’s right. Yep.

Scott

Okay. Um, I’m curious now about some of the other features in Photoshop, like shadow. Is that something that is done?

Brad

Um, usually no. Cause uh, you mean add shadows to like, to the cat?

Scott

Yeah.

Brad

Like have the cat have shadows?

Scott

Yeah. Yeah.

Brad

Uh, it depends on, a lot of it is determined by computation time.

If it doesn’t take very much time to add a shadow, then—but if you’re just looking for cats in an image, then a shadow might not help you find the cat.

Scott

True. That’s true. No. Okay.

Brad

Great question though. So the pros of these type of image augmentations is they’re simple, and they’re fast, and they give a lot of variety to the model. One of the cons is that it is limited by what is in the data set, so it can’t imagine stuff that’s not in the data set.

Scott

Got it.

Brad

Uh, but recent strides in teaching a model how to generate data has come come a long way. These used to just be blobs with eyeballs as pictures. But this is from actually a paper that just came out earlier this month, and it was trained on data—a celebrity faces data set, and it can generate—these are generated images, and they’re very crisp, the faces are very symmetric—which was also a problem early in the days.

Um, but you can see that the model can learn the distribution and the different features and the characteristics of the data in order to generate its own. And now it’s generating—these data samples exist nowhere. It just dreamt them up and created them.

Uh, and if—so this is from a model that came out earlier this month. There’s another, there’s a website called—can you see my, can you see that on my screen?

Scott

Yeah. This person does not exist.

Brad

Yeah, yeah, yeah, yeah. Um, thispersondoesnot exist.com. And every time you refresh the page, it generates another image of a person that does not exist in the world, but they’re very, like, the details of it are very, very, very fine.

Scott

Yeah. That’s cool.

Brad

And sometimes you get these artifacts. Sometimes you’ll see ’em in the ears. Sometimes you’ll see ’em in the glasses or on the teeth. But actually these are, these are not too bad. And the background is just usually fuzzy. Oh, there’s kind a weird earlobe. Actually these are pretty good.

But this is just a website. Every time you refresh the page, it creates another—

Scott

Earring! She had two different earrings.

Brad

Yeah. And this one kind of has an earring. But you can, I guess, but even gets like the older features and even the light. I think this is a really cool website.

Scott

Yeah, that is pretty cool.

Brad

Uh, the pros of this is that, um really—

Scott

Just outta curiosity, are they taking, like, actual features from other people and just like merging them together?

Brad

Nope.

Scott

It is completely from scratch.

Brad

Completely from scratch. So you give it random numbers and it turns those random numbers into a face.

And so every time you give it a new random numbers, even if you just alter one of those numbers—

Scott

What’s your random number?

Brad

Oh, I, that’s a good question. That’d be kind of cool to see what my random… to generate someone as close as possible to me.

Scott

Yeah.

Victor

So how is it able to, I mean, you said this random number… how is it able to generate like a nose that isn’t, I don’t know, huge, for example, or like generate a third ear or stuff like that?

Like…

Brad

Yeah. So that, that kind of gets into the mechanics of what it’s trying to do because it is—this particular—it depends on the model, but it—the architecture of the model is such that you just feed it a bunch of pictures of celebrity faces.

Victor

Mm-hmm.

Brad

And they can learn what features go together. And uh, again, at the beginning, at the early stages of these things, like, and early stages, this is like four years ago, four or five years ago, they were just blobs. Sometimes the people had five eyeballs.

Uh, but just, through the mechanics of the architecture, they’ve set up the model to really be able to extract what makes up a face.

Victor

I see.

Brad

But it, it does, it learns it all on its own. That’s the cool thing about deep learning, is that, you give it the info, you give it the data, and it will learn it by itself, what is important. You don’t have to tell it.

Victor

Yeah. So the pros of this… it’s like a cool thing to show your friends.

Brad

To generate images.

Victor

Because it just generates random images. Uh, but it can also nicely warm up your model before you try to do a a supervised task. So this is also a form of pre-training, but it’s really, to be quite frank, it’s really not that useful, image generation. Because what you want, which is in the next slide here, what you want is to be able to generate an image according to a class that is in your data. That is much more useful for training. So this just comes from a recent paper, but as you can see, you can tell it, “Generate a picture of a fireplace,” and it will generate some images of a fireplace that are beyond what’s in the data. Now, the watches get kind of get kind of funky…

And so when we’re generating images according to a certain class or a certain label that we have in our data set… that becomes much more valuable to us if we’re—a supervised training test. So if we’re trying to classify images, and we don’t have a lot of pictures of fireplaces, we can generate those images.

And again, so one of the biggest pros to this is generating images that are beyond what’s in the data set. So you can add all the functional augmentations, the rotating, the resizing, and cropping to these generated images as well. And typically, we want to generate images when we don’t have a lot in our dataset, and we just need, like, for rare—for medical images. If there’s a rare disease that are on microscopic slide images, then we probably want to generate stuff that looks like that depending on the, what is normal and what is not normal tissue.

Brad

So far we’ve only talked about how to augment images, and quite frankly, augmenting images is a much easier task than augmenting text. And the reason behind that is that is in the data structures of each of those types of data. So an image, if we look at the cat, the image is made up of pixel values, usually red, green, blue pixel values.

But we have these continuous—and we treat ’em as continuous—values. Uh, so if we added one to every single value pixel value in this cat image, it probably wouldn’t change the structure of that image very much. Or if we just added one random pixel, adding just one random pixel would be imper—we wouldn’t be able to see that. Both human and machine wouldn’t be able to tell the difference.

But when we move over to text, text is what we call “discreet data”, and text—every word is its own category. So if we have this sentence, “This cupcake is crazy delicious.” This is how we would represent it, where the word “this” is represented by this second position out of all the words, and “cupcake” is represented by this position.

And so it looks like this. And if we added, if we had the sentence, “The weather is nice today”, and we added one to that data, it would shift all of that data by one, and the output would be “, property, his winner’s child”. So just adding one to discrete data drastically changes everything while adding one to continuous data, in this range, from. from zero to 255, it hardly changes the structure of the data.

And so we have to be very careful with how we—and these tools and these models for generating images all deal with generating continuous continuous data. And so trying to translate those models, take those models and apply them to text is a big gap to cross.

But a lot of strides have been made in generating text. And what happens is that these text models, these language models, they learn from millions and millions and sometimes billions of examples just to know how to structure language, and when they’re called upon to generate text, they need a prompt.

So this actually comes—this prompt here, the prompt is “This cupcake is”, and we give that to the model that’s learned from just piles and piles of text data and we ask it to just generate something that completes that thought, “This cupcake is”. And this text was actually generated from one of the Jaxon models, and it, and it generated, “This cupcake is an expensive, ingenious way to forget American life everyday . The campaign usually concerns or controversial—”

So you can see that the fluidity is there. Maybe it’s not totally accurate, especially with that second sentence. But and you can also get a flavor of what kind of data this was trained on. This was actually trained on general Wikipedia data, and so it’s not gonna, it doesn’t know much about cupcakes or—but it kind of fitted in, when it tried to complete the prompt that we gave it, “This cupcake is”, and so we can generate text and—having it makes sense throughout the entire generated text, and making that text long—here is just a, like 20 words—but trying to make this a long form text is even harder.

Again, this is not guided. We’re not telling it to generate anything pertaining to a particular class. We’re not saying, “Generate a positive sounding statement, starting with, ‘This cupcake is’”, it’s just generating what it feels like, and it’ll generate something different every time. So when it comes to text augmentation, right now, to be honest, it’s a lot of hacks, a lot of interesting—we’re gonna, we’re sticking mostly with the writing functions to make a new example.

And we can change a sentence. So if we have this example sentence here, “This was a great movie”. We can change this sentence on a word-level, and also on a sentence level. On a word level, we’re just replacing certain words in the sentence. And one, there’s several techniques for this. One is just random word replacement. So we just randomly take a few words and replace them with other random words in our vocabulary. Uh, and this first one, “Potato”, it turned, “This was a great movie” into “Potato was a great tuberculosis”, which is also a shout-out to Seinfeld, an episode in Seinfeld.

But that one, you can drastically change the meaning of the word or sentence. Another way of doing this is TF-IDF word replacement, where you, it’s like random word replacement, but we take it one step farther, and we only replace the most frequent words in our corpus which is things like, “This was a”… “movie” might not be as common, but we replace the most frequent words in our corpus with other very frequent words. So it kind of preserves a little bit more of the context, even though this one now reads “Of was in great movie”. Another is just to replace some nouns and adjectives with other similar nouns and adjectives; “This was a good film”.

But we can also do it on a sentence level, and these are usually more computationally expensive methods. For example, back translation, you take the sentence, “This was a great movie”, we translate it into German, and then we take the German translation and translate it back into English. And depending on the language that you use, it can really drastically change the word order.

But going from one language, translating to a second language, going—translating back to the first usually produces a different sentence. and you can actually enforce it to—you can make it give a little more diversity, usually at the expense of fluidity. Um, but it could change that, “This was a great movie” into “The movie was excellent”. And now we have a totally different sentence structure.

Uh, there’s also ways of combining—you can turn an entire sentence into one vector of numbers, and take another sentence that’s very similar, turn it into another vector of numbers, and then just add ’em together. Or take the average, or weight one 70% and one 30% and then add ’em together.

And that should create another sentence representation that is the combination of the other two.

Scott

Empirically, have you found one language better to use than another for back translation?

Brad

Uh, it, I haven’t, no. In some of the papers they use, I think they use French.

Scott

Okay. Just curious if there’s one that changes it up too much or not enough, and there’s a known—

Greg

It would be interesting to see if you could do it with two languages, do it with French and German, and get two different augmentations out of it.

Brad

Yeah. For. Or even like French and Korean.

Yeah. And so there’s, there’s all these different ways, but they’re really—we’re really not—it’s still a very hard task to have a model look at a corpus of text, learn language, but also learn the classes that those belong to. because, and we can—there are some workarounds; instead of using discreet values as your inputs for words, you can use word embeddings, but even word embeddings are just an approximation to the true value of—true representation of a word.

Just like, because something like word2vec is also trained on some other corpus that may not have anything, the context of that corpus may not have anything to do with the data that you have on hand. And so there’s still this bridge to gap. Um, instead of using discrete representations, you can use word2vec, which is a continuous representation, but it’s still not exact representation of a word.

Scott

Interesting. I didn’t know that. I didn’t know this word2vec was trained on the Google News vocab.

Brad

Yeah. And you can train your own word2vec on your own data.

Greg

Yeah, we do. Yeah. Every topic model we create in the classical tab is a custom trained word2vec.

Brad

That’s right. That’s right. Um, and then–

Scott

Fast text? Do we do custom training for fast text as well?

Brad

Yeah. Is that for me or Greg?

Scott

Greg.

Greg

Yeah, same deal.

Scott

Okay.

Greg

But since we’re on the topic, we can also import pre-trained word2vec, embeddings into our neural models. Uh, which was something that we had added to support Korean.

Brad

Ah, that’s right. It’s hard to find a fully pre-trained Korean model. Might as well have part of one.

Scott

Very good. All right, thanks. Carry on.

Brad

And then finally, in Jaxon, one way that we used data augmentation is—we use it in conjunction with supervised learning. So you have a—and we use this when we have a small amount of labeled data and just piles of unlabeled data, that is a great scenario to use some data augmentation.

But how Jaxon uses it is that in the purple box there is just the normal classification process, where you have labeled data, it goes into a model, the model makes a prediction, and then it learns. With—in regards to how right or wrong it got that prediction.

But what we add is an auxiliary step, and while the supervised learning in purple is going on, in the blue box is this, we’re trying to utilize the unlabeled data. And how we do that is we create—we take an unlabeled example, and we create an augmented copy of it using one of the ones that we just described in the in the slide before.

And, for example, just random word replacement. We replace a few of the words in the unlabeled original example and create its—a pair that’s been augmented with a few switches. We place both of those, both of those go into the model, and the model makes a prediction on both of those examples.

And the point of this is not to get the prediction right, cause it’s unlabeled. Even the augmented clone or copy of it is unlabeled. And so we don’t know what the true label is, but we want the model to be consistent in its predictions. Whatever it predicts, for the unlabeled example, we want it to predict the same thing for its augmented counterpart, and that’s called consistency learning.

And so we have these two things going on at the same time while the model is training. And it turns out that this is a very powerful tool because even though you have very few examples labeled, the model is learning how to classify those correctly.

But even slight variations of what you have are gonna be classified into the same thing. And sometimes we make drastic variations. We create a really augmented, or a heavily augmented example from a piece of unlabeled data, and we say you need to predict these two things the same. And in doing so, one example can have a lot of variation and still be classified into the same category.

And so it can take a lot—it can take a small amount of labeled data and really, really generalize well from that.

Scott

Very good. Well, I didn’t wanna interrupt you there, but I know we’re, we’re past the half hour mark. Um, we do have a few extra minutes if anyone has any questions. If not, then please feel free to reach out to us directly. You can email jaxon@jaxon.ai. And I hope you enjoyed this webinar and learned something from it.

Brad, that was awesome. I really appreciate you sharing all that wisdom. Um, and for clarity, we have this baked into the Jaxon platform, and if you’re interested in seeing a demo, then let us know. All right. Thanks everyone. Take care.

synthetic data creation with the flask: freeform text and tabular data are processed with different synthesis methods. the results are filtered. this synthetic data is ready to use in your models.