We had hubris a couple years ago, thinking you can create accurate machine learning (ML) models completely unsupervised. Turns out, some human supervision really is needed. Just enough human knowledge to train the model properly. The trick is to minimize that human supervision and use machine learning itself to alleviate the “grunge work”. There is a lot of data pre-processing, labeling, and model calibration that can automate the bulk of the process.
You want to have some ground truth in the form of pre-labeled examples for each class that you are trying to classify data into. How many you need per class could be as little as one or two like some companies are saying in the category of few shot learning. But back to the point of labels and human supervision, when you go completely unsupervised you are missing out on a lot of knowledge that humans have. It all depends on the use case and the data itself, but we typically see great results with roughly 100 per class. With Jaxon, you can quickly get guidance on how much really is needed to hit your desired performance metric.
A great way to encapsulate human knowledge about a domain is to write heuristics. Marrying these heuristics into an ensemble of classical and deep neural models usually should be done with more sophistication than just a majority vote. With innovations like Snorkel coming out of Stanford, intelligent aggregation and the ability to abstain from even using a heuristic is now possible. Here’s their definition of the new category of ML called Weak Supervision:
“Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting.”
Heuristics are really powerful and a great way to fill in gaps of coverage for a model. We recommend feeding data into Jaxon and seeing how the trained model performs. Maybe it’s good enough without any further human time. Likely, heuristics can give lift to areas where the classical and neural models are performing subpar.
Another lever to pull in Jaxon is synthetically labeled data vetting. Using a small amount of pre-labeled examples, Jaxon bootstraps an ensemble that serves as a labeler. From there, users label their raw (unlabeled) examples. The cycle continues as a new dataset in Jaxon and, through a vetting portal, users review the labels and either agree or disagree with the label. Simple as that. We recommend sampling 500 examples (regardless of corpus size), which typically takes a couple hours to vet.
Using vetted labeled examples (what we call gold) for training, testing, and calibrating is key for hitting high F1 scores with minimal human time. It’s especially handy for noisy data (which, let’s be honest, is the norm, especially with natural language data sources). This gold amplifies human knowledge and drastically reduces the need to manually label tons of examples.
Look at this before and after for a model we recently helped create:
– Scott Cohen, CEO