Rapid Prototyping Webinar

Does there have to be a tradeoff between speed and accuracy? Using rapid prototyping, you can have both.

In this talk, the Jaxon team discusses how to minimize your model’s time-to-production while maximizing its usefulness and accuracy.
 

They discuss how to solve real-world business problems, starting with the concept of the “problem spec” and how it can be used to frame problems in terms that ML models are equipped to solve. Next, they dive into model creation and iteration, plus some common stumbling blocks and techniques to improve models. Finally, they discuss learning rates and multi-stage model training.

    Show Transcript

Scott

Hello, and welcome to the Rapid Prototyping Webinar. I’m joined today by Greg Harmon, co-founder and CTO of Jaxon. And Robin Marion, also co-founder but CPO, Chief Product Officer, of Jaxon. Welcome and thank you for joining me.

Robin

Thank you for having us, Scott.

Greg

Good morning, Scott.

Scott

Good morning. All right. Well, it’s actually good noon. So I figured I’d just start by giving a quick overview of who Jaxon is, why we exist, and then we’ll get into the meat of rapid prototyping and how Jaxon can be used to quickly find winning models.

All right, so at a high level, Jaxon is taking raw data, representative of specific use case that our customers want to build custom machine learning models around, and labels that data. And it labels it with AI itself as well as a small seed of human input, so, as little as possible, or I like to say, “just enough” human time to get highly accurate models out the other side.

We’ve taken things ranging from the pseudo labeling that drives a lot of the synthetic labeling; speaking of synthetic, creating synthetic data itself to augment an actual dataset, gold loss correction, model calibration generative modeling, there’s a hodgepodge, and we’re not gonna get too deep into all of them today, but we’ll probably touch on many of them as we progress in the conversation.

High level, where we fit into the world is data comes through us. We create the fully trained models, and then they go off and do their jobs in runtime.

So once it hits production, it’s out of Jaxon’s hands, but there’s an API called back to Jaxon to continuously train these models or continuously keep them up to date with the latest data. If it’s natural language, the latest way that people are talking; if it’s more structured data, the latest historical data that can drive the learning of the prediction models.

All right, I’m gonna get off my soapbox. In fact, I’m gonna get out of this slide deck and turn this into more of a conversation. So to do that, I’m gonna go over here actually to the agenda itself.

So, we have set the agenda to be a high level discussion around how you start out this model’s journey, what’s the solution breakdown? Then we’ll get into things like neural architecture design and then ultimately harp on… it’s all about data now, that the models that are freely available are really pretty good, and it’s more about getting the right data to drive the training of those models, or what the latest buzz is around data-centric AI.

All right.

Robin

Right. So if I can join in there. So, Scott you touched on one of the biggest problems that businesses see in this industry—is that identifying the business problem is critical, and that’s one of the initial discussions that the business and the engineering teams, the machine learning teams have to really get into the core of what businesses really need.

And so defining the problem spec, identifying what is that the models or the trained machine learning models are expected to return to have a business impact. And that takes a bulk of the practice when it comes to machine learning. So a lot of times you hear about these machine learning projects, when you talk to business leaders about taking a long time, like about four months, six months, and still the results are not found.

And there are a lot of reasons why something like that would happen, and one of the main reasons is not really being able to identify what the problems spec is, or is the model really giving the right results that is required. And so Jaxon helps businesses and the machine learning teams to be able to succeed faster by the concept of rapid prototype.

Greg

Yeah, I can pick up on one thing there. Uh, folks listening may not be familiar with the notion of the problem spec, which is something that we use routinely. Um, it’s an idea that we took originally from some of Don Neuf’s [unclear] old writing. But the, the notion is, as applied to machine learning, is that we spend so much time focusing on what models we use or what data’s going into the model. And, I think, not nearly enough time focusing on what the model’s trying to predict on the inference side of things. You know, typically folks will just get some data set, maybe it’s pre-labeled, but that one label set defines the reality of what you’re predicting.

But “I want to classify this versus that” doesn’t always line up exactly with the business problem I’m actually trying to solve. And there are often different ways, to frame a particular problem, or, getting more concrete, to craft the problem spec that we’ll address the same general business need, and sometimes figuring out that can be is certainly more important than model architecture selection. But even more important than specifics of the data. Data-centric AI is important, but I think problem-centric AI, you heard it here first, folks might be even more important than being data-centric.

Scott

So let’s pick up on one of the toy data sets or, or toy projects that we’ve been working on for years now around recipe classification. Taking different recipes and classifying them into… what cuisine type do they fall into. So we have things like American, French, Italian, Greek, Chinese, Japanese. Uh, there’s a good dozen or so, and there’s a lot of confusion around the Mediterranean area, where we have the class called Mediterranean, but we also have the class called Italian. And how do you disambiguate between the two of those?

So immediately problem spec comes to mind. You need to fine tune that to get the model to where you want it to be because it’s constantly gonna have problems there. Or you remove the one that is causing ambiguity, like Mediterranean, and you have individual representations that would cover the gamut. Uh, one thing that strikes me is that you’re then left with minority classes that aren’t well represented in the original corpus of text. So all of a sudden you can’t represent a particular class because it’s not in the data.

Greg

Well, I’ll take you one further. I’ll raise you one there.

Tomorrow, there’s a brand new nation that appears with its own, its own brand new cuisine, which nobody’s ever seen before. How do you accommodate that as well? That’s the equivalent to having a brand new class that was completely unrepresented in the training data.

Robin

Right, right. Yeah. So you know, we are talking about examples. There’s one other example that I just remembered from my past life, where we were trying to identify a fraudulent login to a cloud and entity from multiple different channels. So it could be web, it could be the app, it could be a device. Of you know the company that I’m talking about, which I spent a lot of time, we had a lot of devices, and people could log in into their account through multiple different devices, and every devices had a different type of rules.

And so one of the challenges that we had, I remember when I was driving this conversation with our machine learning team, was they wouldn’t be, they were not able to identify: what are the rules associated with the channels? And so the first round of data collection that happened and the model training that was done resulted in fraudulent or, you know false positives. A lot of false positives.

As an example, like if people on the web try to log into an account. There [are] multiple reasons why it might happen, but if they’re trying to, like, it might fail, they don’t have the password. But if you’re trying to do it from a device, like let’s say an iPhone… the password is saved, and the face recognition let’s auto, auto pop, it’s a password. In that case, a failure weight should be different.

And so the measuring of where the failure happened, how often did it happen and, specifically, was it the web channel, was it the device channel? That data, so, talking about problem definition, as Scott and Greg, you said, I think problem-centric machine learning, modeling, or identification of the right model goes also hand in hand with “data-centric” you know, identification of the problem because that really solidifies that we are solving for the right problem.

Greg

Right. And I think you’re, you’re hitting on one of the important aspects of our notion of problem spec as well. Yes, it’s about what are you trying to predict, but the other half of a problem spec is about “and how are you measuring that success?” in an example like this cuisine classifier we’ve mentioned.

Do you actually care equally about all the classes? Are certain mistakes more expensive than others? Do you want to use accuracy or an F-score with your beta, depending on what you’re trying to do? What’s the right metric to be using to evaluate the success of how well you’re matching up to the core task and class list in that case?

Scott

All right, per perfect segue to Jaxon 1.5. So this is the first time the world is seeing this. This is our latest release. We are very excited about it. It is now multimodal, where we can handle both natural language and structured data, aka tabular data. And I figured I’d take you through one of the projects that I’m currently working on.

I started it yesterday, and I’m in the middle of trying to make predictions on whether or not someone’s going to click on an ad. I can show you the columns or the features, as they’re called, that I am playing with things like age, area, income, how long they’re on the internet per day, and the ad topic line, importantly, which is the natural language piece.

So I am in our neural network creation tab, and I am tinkering. I am trying out different combinations. So I started out yesterday, like I usually do, with a quick brew. A brew is not as heavy as these other models that we support. You get a quick read on what your data is and how it’s going to relate to making predictions.

And I have a 49. For clarity, those that don’t know, we, when we’re training these models, carve off a test set from the training set, and that’s what this F-score is related to, is how accurately did the model that was trained on the training data set do against that holdout test set. So my first pass, I got 49 and I was like, oh boy.

All right, so let me go up to where Roberta, maybe the model of a beefier model will do better, and it got worse. Not only did I increase the text representation model, I increased the feed forward network from a small to a medium. And it went down.

Greg

Just to pause you there, just so to catch people up. So part of the way that Jaxon is handling these different modes is, we’ll combine different representations. So, yeah, clicking, clicking there might be be good to show people, when you start working through a new model, one of the things you need to do to design it is to pick a different representation to handle the tabular data versus, say, the text data, which are the two types that we have in here.

For the text data you want to take advantage of what works well for natural language data transformers primarily, this day and age. But the transformers actually don’t work all that well for a lot of tabular data and feed forward networks tend to tend to actually at least equal the transformers with a lot less weight, if not even outperform them in times.

So we’re actually picking different representations to handle the different modalities. And then, underneath the hood, we’ll do some things to to help combine them. Certainly a naive way to go about it is to simply train these things in parallel, merge ’em together right before you go through a task layer, which by the way would be customized by the problem specification we’ve set.

But some other things that we actually will do are, in particular, we implemented our own version of some research that Facebook had been exploring last year around why multimodal data’s hard, and effectively you get one representation. Your feed forward network might learn much, much quicker than your transformer. And then you end up with one of your representations overfitting and the other’s still underfitting when everything kind of bottoms out, which isn’t what you want. So you can strategically pace the multiple representations to train at the same rate as each other so everything kinda achieves its optimal peak all at the same time.

Scott

So I’m gonna get right back to that. Hold that thought. Um, I just want to go to the heart of where I was struggling. I was in the forties, and I tried out heavier weight models. I tried out gradient blending, which is what Greg was just alluding to, and I still wasn’t getting where I wanted to be.

So I came back to the features, and I had selected all of the features, all of these columns were used in the model’s training, and I started playing around with removing some, and it turned out these three were negatively impacting the model’s training. And once I removed those, all of a sudden I jumped up 30, 40 points.

I like to think that I doubled my accuracy in matter of minutes, but… getting aside from marketing, let’s talk about feature engineering for a second. Greg or Robin, any thoughts on this subject?

Greg

Well, I think I just gave a pretty compelling example of why it matters. I know that the promise of deep learning is effectively that it’s automatic feature engineering; that’s the entire purpose of deep learning the actual classification bit, it really is just a pretty simple linear layer. It could just as easily be your favorite tree-based system. It’s those representations which are the feature extraction, that’s the promise of deep learning.

And yet trusting the deep learning to do it with the current models that are reasonable to run on a single GPU system got you to mid forties, doing just a little bit of injection of human common sense and looking at the data.

Scott

Mm-hmm.

Greg

It just bought you another 50 points.

Robin

Nine. So let me ask you this question, Greg. Uh, what about positive and negative signals that might be present within the data? I think in this case, in a way, what Scott was experiencing was for a given set of… given number of… size of a data set. You know, if we have some significant data points in there that’s not given enough accurate signals to the neural net to be able to identify the correct, I would say, impression that it should have on the output to be a successful result.

Um, what are your thoughts around that? Because I think that’s something that I see across the board when many a times, and I think that’s one of the main for when it comes to machine learning projects. It’s not easy to identify… is… does the data have the right signals?

And that’s exactly feature engineering, right? I mean, you’re trying to identify the best features that gives you the best signals to identify and predict what the successful and the correct outcome’s gonna be.

Greg

Right, right. And that’s exactly why you wanna be able to iterate quickly on them. You know.

Robin

Okay.

Greg

The limits of automation are really what we see when we let the deep learning itself do it. And if you need to get beyond the deep learning, that’s… if it’s not doing a job, then that’s where a little human input comes in, and features selection’s one of the easiest and most powerful ways to do it if you have things set up so that you can really quickly iterate.

While we were having this discussion, I saw that Scott was clicking around in there—there was an interesting loss curve that popped up. And if you wouldn’t mind going there, I think folks who are familiar with what loss curves normally look like might be interested in hearing an explanation for right around, trying to stare at my screen, epoch 20. There’s this giant jog here in this particular loss curve, which is atypical—otherwise, you expect your losses to gradually go down asymptotically in your accuracy, which is the black line, to go up.

What’s happening here is that gradient blending that we had mentioned before, so part of what the gradient blending does every so often, it will send a shock to the system, and it re-weights how the system is treating the overall loss, as well as the modality specific losses, which helps rebalancing so that we don’t max out on one before the other is finished learning.

Here, you can see that the blue and the purple and pink, which represent different modalities, started to diverge. The orange line, which is our text training, was flatlining, so our tabular was learning much faster than our text data. So we’re sending a shock to the system to try to balance those out. Um, which looks a little funky in terms of just seeing absolute loss charts plotted out if you’re used to how they typically look when you just have a more smooth and constant loss function throughout your training.

Scott

All right. We have a few questions, but before we do that, I wanted to just run through one training schedule creation. So I’ve tried Longformer, I’ve tried DistilBERT, and I’ve tried this LSTM, and LSTM so far is the best model, which is kind of surprising. And it kind of speaks to the “no one size fits all” or the “no free lunch theorem” that these big transformer models you’d think always are gonna prevail… but sometimes there are other architectures that do better with particular data and specific problem specs.

Greg

Yeah, we could speculate about why in this case, but, you know, my suspicion is we’re getting a lot of signal, a lot, probably more signal out of the feed forward network just based on how these loss charts. Here’s another one where I think gradient blending was not applied, where we can see that the tabular loss learned very quickly, and it’s a very slow, slow downward, like with the text loss.

But if we actually look in this particular dataset at what’s in the natural language, it’s not long complex prose, it’s very short snippets that the, something like the LSTM, that’s maybe chopping it up in terms of a very short attention span, a few words, versus trying to actually put together prose.

It just seems to fit that data a little bit better than one of the bigger transformer-based models would. It has to do as much with the model architecture that does, with what those models happen to be pre-trained on as well. They’re similar, but they’re—BERT, for example, or Roberta, were trained on different data than the LSTM was.

Scott

Is my hover coming through? Just wanna—

Robin

Yes, it is.

Scott

—Prove you right, that gradient blending was not used in this particular training pass.

Robin

Right.

Scott

Which I find interesting, ‘cause sometimes gradient blending was giving lift to the model.

Robin

Mm-hmm.

Scott

Other times, it wasn’t. And that’s kind of the point of rapid prototyping, is that you try out different techniques, and you see which one’s gonna stick.

Greg

Right, and you also think about your trade-off system-wide. So regardless of how you bring that text analysis into play, one of the things that we could see from looking at a few of these is that maybe there’s a little signal to be found in the text, but most of the signal in this particular data set is in the numerical columns, in the tabular data.

There’s a very weak signal at best, meaning that if you’re trying to engineer a system that would be fast and cheap to run in production, it may well be that you should just give up the text feature, not have to carry along the very heavy transformer model along with it, and you can get very nearly as good a predictor here with just a small feed forward network, which you can run lickety-split, and probably don’t even need a GPU for, to do fast inference and production right.

Robin

Right, okay.

Scott

All right, so I’ve got—sorry, Robin, did you wanna say something?

Robin

[loud noise warning]
No, actually I think I totally agree with you, Greg, is because I was working on a different data set that was basically about 40,000 tweets and trying to identify the emotional, you know, identify, classify them based on emotions, and had about 12 different classifications for it.

And I experienced that exactly what you just said is basically like deepening up on the training model, we, I was able to get some step by step increments in accuracy. And as I went through testing out with different algorithms which is conveniently very fast on the Jaxon platform, we were able to see a really good high score in the lower eighties for that wide variety of tweets, which… it should be there in our demo server, but I just wanted to put it out there. Scott?

Scott

Yeah, absolutely. It’s been a common theme over the past few years, is… a lot of data sets have a number of flaws to them. And you’re training your models against a flawed starting point. You’re not gonna be happy at the other side. Yeah.

Robin

And I think you might be a little bit vague. Just wanted to, also, put it out there, once we have a right model which in the, high accuracy, 90-95, or depending upon the data, in the eighties, you’ll be able to use the Jaxon platform to not only complete the rapid, continue rapid prototyping, but let’s say if you want to use that to auto-label, you can also do that with instruction platform.

Scott

Alright.

Greg

We have a question a little while back from from Philip… just while we’re, so we don’t miss it.

Scott

Okay. Go ahead.

Greg

Um, is there a way to easily identify which data points are skewing results, or is it trial-error? It unfortunately is a lot of trial and error. There are explainability techniques that one can sometimes apply. I think of them as more, more attribution. Um, the trouble is it depends on your architecture, and it can be kind of heavy to get into it.

Things like, say, Shapley values can definitely identify how much any given feature identified, but it gets it gets heavy in a hurry to try to do some of that stuff, so you have to think about… is that actually a pro over some quick and dirty trial and error? And part of what we’re trying to do is say, “Hey, can we keep it simple, stupid?” Go with the easy, low hanging fruit. Let’s let people just try, because people are probably gonna have an intuition about which things are likely to be hard and easy at first. And then save the heavy sort of machine, explanatory stuff for the few cases that really need it.

Scott

All right. Just get through it, and I wanna stop sharing my screen and then just finish out the webinar here with the final questions. Um, I wanna train a new model here. Greg, do you want to direct me as to what to pick?

Greg

Sure. Um, now you’ve had some experience with this. If I were coming at this with that experience, I already kind of have a feeling for where things are. But if I’m coming brand new, I’m probably going to… You know, go right down the middle. A medium feed forward network. It is always a reasonable default choice.

Likewise, some of the text representation gets to be a bit of a personal preference. I, personally, will tend to, and by the way, we can install many other models, but the, this is a typical default set for us.

I tend to like Roberta as a pretty good go-to at first. It’s not the lightest weight model here, but it’s not the heaviest weight, but it seems to be, for most cases, it’s the pretty good, happy medium between the two.

Scott

All right. How about for learning rate? I know it comes preset, basically.

Greg

Yeah, I would leave that alone. I, the first time through, I would go as dumb and default as I possibly can, see my results, and then, depending on what I see for, in terms of not just training accuracy and things like our confusion matrices for doing a single label classifier, but also the loss charts, based on that, then I would consult, for example, our cookbook, which has some common patterns in terms of what you see, and then decide from there… Where do I go next to… if I need to improve things.

Scott

Worth noting that these are templated. So if I change from Roberta to Longformer, note that some of the parameters have changed, or if I go to a GRU…

So we have, and this is kind of answering one of the other questions I see in here, are you an auto ML? We have things that have automated the process. Yet for those more advanced users, they have control, and they can make changes. All right, so leaving these alone, how about these options down here?

Greg

Well, there are lots of ways. We probably don’t really have time to go deeply into any of them, ‘cause I see we’re really hitting up against the the half hour here. But we, we have lots of different techniques that can allow you to condition how your training will be administered from different scenarios.

Um, we’ve talked about gradient blending, for example, quite a bit already, today, but plenty of other features here to get around either different typical training challenges we see or limitations of some of these models.

Long form document handling, for example, helps us take transformers, which tend to be heavily memory-limited to certain lengths of token sequences, and we’ll actually dynamically adjust the models there to handle longer strings, and we start introducing attention layers and that sort of thing to enhance the transformer representations.

The other thing I think that’s probably worth pointing out here before we set this one down for the day would be the notion of multiple training stages, potentially with multiple task specs.

That is, we don’t have to do just one pass of training. We can set up a multi-stage where we—pre-training is a typical example.

We could do some language modeling to maybe improve, say, in this case, the Longformer. Maybe we wanted to learn a little bit more about, in this case, it wouldn’t be a long tail foreign language, though you could use it for that purpose, but in this case, we want it to learn the language of ad titles, and we can get it to be more sensitive to that. And as we know from watching some of the big language models that are kind of taking over a lot of the hype and excitement today, those are heavy processes to train.

So it can sometimes be useful to train them once, and then have your own customized version of BERT or Longformer or whatever you like in your company, and that’s it could become my Jaxon adBert, and then I can use that as a base to do a lot of rapid experiments having already customized that model a little bit more to fit my own data in my own organization.

Scott

All right, so while Greg was talking, I just added in a few more stages. I chose different models, different combinations, and Jaxon will figure out where to stop, where the highest accurate model will lie. So I’m gonna hit standard priority here. And it’s off and running, and we’ll find out how we did in probably—

Robin

—The next webinar.

Scott

Yeah, the next webinar.

We do have a GPU tied to Jaxon, and it’s gonna go off and process, but even still, it takes a good bit of time to actually train. Usually it’s within 30 minutes or so. Uh, the longer ones can take an hour or so. So, alright.

Um, we do have a few more questions in here. I kind of alluded to it, but let me just ask it, and I’ll let either one of you take the answer.

Do you consider yourselves to be an auto ML company?

Robin

Eh…

Greg

Yes and no.

Robin

Yes and no, exactly. It all depends upon the user, right? I’m like, we cater to users who want the flexibility to be able to manipulate a lot of the bells and whistles when it comes to modeling the data, but at the same time, not having to do the whole setup.

Like let’s quick… click on things, get something, you know, data inside and imported and trained as soon as possible to see the results, which is basically our idea [?] rapid prototyping while at the same time giving them the flexibility to manage and modified things just as Scott showed in the, and adding more training levels.

But at the same time, we also have an option where you can choose the autopilot mode, which does a lot of these things for you automatically, that …Greg can have add a little bit more flavor to the autopilot option that we have in Jaxon.

Greg

Yeah, I’ll just put it succinctly.

AutoML is a tool in the toolbox, but it’s not the toolbox itself.

Scott

So one thing I wanted to make sure is clear is that autoML companies expect you to show up with training data. They expect you to show up with that data already labeled, ready to run a model against. Jaxon’s one of a few, I don’t know of many actually, um, if any that have both the labeling capability and the model training in one platform. All right. Next question. Well, we kind of just answered this, but looks like you need to be a data scientist to use your tool. Is that true?

I’ll take this one to start, and then I’ll get each of your takes on this, and then we’ll wrap things up. We intentionally have built this for those that don’t know how to code. They may have a title, even, of data scientist, and maybe they actually do know how to code, but it saves them a ton of time by letting just Jaxon do it.

To try out all these variations requires custom coding. Usually, data scientists will work in their notebooks. They’ll pull in a bunch of different libraries, and in order to get them to work, they’re writing custom code to bring it all together. Um, so we’ve automated a number of those features. There are parts of the platform that are squarely not even doing data science, like the labeling piece.

So we’ve built Jaxon also to cater to those that just want to have their labelers use Jaxon and have everything in one platform. So once the labeling is done, there’s a handoff to the data science team.

Robin

Right.

Scott

And lastly, I just want to cover that engineers are also welcome. We’ve found that there… a lot of organizations have data scientists working around, I mean, alongside data engineers, sometimes two to one. And we want to give those data engineers a tool like Jaxon to be able to automate a number of the processes that they need to support the data scientists.

Greg

I think it’s worth pointing out that just because you don’t have to code use Jaxon doesn’t mean you can’t, everything that you saw was API driven.

So there, there are both options. But I would just say you don’t have to be a data scientist to use our tool, but using our tool might make you a data scientist.

Robin

That is a good way to add it. Uh, just to add a few more things to what Scott and Greg just added in my experience what I have seen is a lot of the people working within an organization have access to the data, right? But they don’t know what to do with that data. What can you do? So, as an example, product managers across the board use data on a daily basis, but if there’s anything that they need to do on the data, like they have understanding of how model training works.

Those things are now very commonplace, and product managers have experience working with data scientists, so if they have the data, but the data scientist is busy, let’s say they have a lot of other projects going on, Jaxon is a tool that a product manager can log into, import the data, and run some basic modeling on it to see, even, what things are they thinking about, is that even relevant? Like… data’s giving the right signal. Is it the model being able to train?

And then use that as a basis for a conversation with maybe the data science team, and organizations that don’t have the data science team, they can use—Jaxon also has the capability to export that model into production and be used. So we… saying that you need to be a data scientist to use Jaxon, as we said, it’s, it’s not a requirement. Anyone who wants to manipulate data, who wants to use data and identify, does it give you the right signals, experiment with it, they are all users and can use Jaxon.

But as… I like what Greg ended with. Using Jaxon could make you a data scientist.

Scott

That that is good. I like that. I might just use it. All right. Very good. Well, we’re over time, but I’m sure we could keep talking for hours on this subject. This was fun. Thank you, Robin, thank you Greg, thank you to our audience. We’ll hopefully be turning this into a video that we’ll post on our blog in the not-too-distant future. And if you’re interested in seeing a live demo, please reach out to jaxon@jaxon.ai. Thank you all.

Robin

Thank you.