The Trouble with Active Learning

In the beginning, training data platforms (TDPs) were simple web apps. They allowed a team of human labelers to coordinate their efforts hand-labeling a dataset for training a machine learning model. And they weren’t so great.

Each training example to be labeled was selected at random from the overall pool, and there needed to be a lot of them. Skewed datasets made this worse: how many white swans needed to be labeled to find each black swan? Spoiler: it could be a lot. If only there was a way to make those black swans easier to find.

Enter active learning onto the TDP landscape. Now—and it is now; this is the basis for most current TDPs—active learning models can find rarer examples more often than they would naturally appear, reducing the overall amount of data to be labeled before finding a sufficient number of black swans.

So, does this work? Well, maybe. Let’s poke at it a bit: Are there enough pre-labeled examples (including those elusive black swans) to bootstrap the active learning model? If not, there’s a cold start problem. Back to random sampling until there are enough black swans to bootstrap the active learning model. 

This creates an incentive for that model to be effective in a low-data regime. After all, if it requires millions of training examples, then it doesn’t really achieve the purpose of bootstrapping training data. The swans will have become chickens, with their concomitant eggs. Small models, such as SVMs, are popular choices here. 

Many active learning platforms don’t export the labeled data directly, rather choosing to export the active learning model itself as the final product. However, the need to overcome cold start implies small models optimized on sparse data, not the best models that technology can provide. When results matter, would you prefer an SVM or the BERT family driving your text classification?

The trouble with this is that the examples selected by the active learner don’t always transfer well to a different model. In one set of experiments, a text classification scenario iterated over combinations of active learners to “downstream” target models – an SVM bootstrapping an LSTM, for example. Active learning outperformed a random sampling baseline only 37.5% of the time!

Q: But, but…Jaxon has active learning capability! Gotcha!
A: Allow me to rebut.

Active learning has its role in a modern TDP, but it can’t stand alone. Combined with other techniques, it offers a valuable tool to help leverage machine learning to power machine learning. Used as a building block, active learning can be combined into a broader platform to achieve labeling efficiency, and ultimately transfer that efficiency over to model-building, which is the ultimate end goal. Labeling is simply a means to that end, after all.

Some of the approaches that Jaxon adds to basic active learning include:

  • Iterate not just models but also human computer interface (HCI).
    • Should the human labeler provide a label for every example, or should they be allowed to skip? How can they characterize their level of certainty for a provided label? Should they select the label from all available labels, or simply confirm or deny a proposed label? All carry implications in several spectra: accuracy, speed, completeness, and fatigue, to name a few.
  • Add a broader label acquisition strategy.
    • This is implied by the cold start problem above: random labeling must occur until a necessary minimum is in place to bootstrap an active labeler. Also, a user could be asked to edit an existing example or create a new one. Labeling heuristics can be written a la weak supervision. Generative models can be used to create synthetic examples. These strategies carry different mixes of the dimensions above (accuracy, speed, etc.), and the relative tradeoffs change with the specific learning context. This is an opportunity for a supervising “meta” learner to guide the training process.
  • Iterate models too!
    • Semi-supervised learning is useful. An active learner can be used to bootstrap a model—or ensemble of models—which can synthetically label a larger dataset, which can be used to train another, better model. A key to such iteration is realizing that each model (including the humans who originally provided the bootstrapped labels) is noisy. There will be errors in the training data, and drift and variance in how model performance is measured. Quality filtering, calibration, and other techniques for robustness to labeling noise are keys to effective iterative training.

Active learning can certainly be an important building block for an effective TDP and cost-sensitive (or maybe it’s cost-sane) machine learning. However, it is not sufficient. Combining other techniques and, importantly, strategically executing the entire model training1 lifecycle in context to the task being learned is essential.

– Greg Harman, CTO

1 Here I really do mean training. I promise this isn’t an attempt to upsell MLOps tools! There are a lot of companies that offer these. We aren’t one of them. Our lane is training, and we stick to our lane.