The 'Gotchas' of ML/NLP

Building machine learning models can be exhilarating—finding that optimal combination of technologies and piecing them together into a final, smooth end result gives a unique sense of accomplishment that only data scientists and engineers really understand. The process to get to the final model can often be tedious, though. Inexperienced data science teams might encounter what are referred to as the “gotchas” of ML and natural language processing—obstacles that may come up that they don’t know how to handle or might not even be aware exist. Each one presents its own challenge and there are far too many to include in this short writeup, but here are a few common occurrences:

State-of-the-art ML/NLP Models are Brittle and Spurious: Beware of undertraining and overtraining

Models can fail when a little bit of text is modified, even though its meaning is preserved. ML models often memorize artifacts and biases instead of truly learning. This is overfitting at its worst. The whole point is to train a model that generalizes to novel (yet unseen) data. Anything other than generalization performance is window dressing.

Disregarding the “Data Science Method”: Starting without a plan

Data science is a structured process that begins with well-defined objectives and questions followed by few hypotheses to fulfill the objective. Data scientists often tend to jump into the data without thinking about the questions they need to answer through analysis. It is essential for any data science project to have the project goal and a perfect model goal. Data scientists who do not know what they want end up with analysis results that they do not want. Don’t put the cart before the horse.

Feature Engineering: It’s all about the input, stupid

The data scientist (or data engineer) will spend at least 80% of their effort on data preparation: collecting, visualizing, cleaning, and imputing the data. This task needs to culminate into the final Feature Engineering task where the n-dimensional feature vectors are defined. It’s all about the input quality -> Garbage in is garbage out. Give a ML/NLP model good input, and its task (prediction) is much easier. Once quality input is available for training, we then decide how much of it we need for training to ensure the model can generalize without overfitting. This depends on the type and complexity of the ML/NLP model used. Novice data scientists do not give this stage enough attention.

Focusing Only on the Data: Too much of a good thing

Novice data scientists forget that if you crunch data long enough it will say anything you want it to say. If you have a very large data collection, you’re going to find correlations. And people tend to conflate correlation with causation, as seen here, and forget that correlation does not imply causation.

Ignore Probabilites at Your Own Peril: Why we should all be Bayesians

Novice data scientists tend to ignore the possibilities of solutions that could lead to wrong decisions more often than not. There is no single right answer to a specific problem – hence, informed choices have to be made from various possibilities. These possibilities live in the solution design space. Probability theory and Bayesian statistics are two essential tools that allow the data scientist and engineer to explore this design space and find a solution that is good enough, if not optimum.

A probabilistic approach ensures that decisions made are more often correct. The Bayes approach is an average-case analysis that considers the average risk of making a wrong decision when determining an estimator over the design space. This includes the (hyper)parameters to be fit using the training data. The Bayes risk is the minimum risk that can be achieved. This is theoretically proven. And this approach applies both to human and machine endeavors: to both artificial intelligence and the machine learning solutions we build. Be a Bayesian.

– Dr. Robert Jones, Principal Data Scientist

RAG is NOT Enough