AI has a dirty little secret… it’s powered by humans! Armies of humans are annotating text to feed machine learning models (e.g. social, news, emails, call & chat logs, etc). That’s how machines learn (today anyway): they train on millions, if not billions, of labeled examples related to the domain and the specific use case(s). Deep learning models in particular have voracious appetites for constant streams of labeled data. There’s an estimated 2.5 billion terabytes of data being generated every day (with new information, different patterns, and shifting trends). Unfortunately for machines, it’s useless unless it’s labeled.
This labeling requirement is so huge that there are converted factories in China, India, and beyond to house the humans working to label data manually. Companies are paying tens of millions of dollars for these efforts and it takes months to get enough data to even start the training process. It is a wildly inefficient process, wrought with error, inconsistencies, and bias. And a glaring fact is that humans simply can’t keep up with the pace of the mountains of data raining in every day!
In recent years, AI has had a resurgence to become a major force for change. But, in the world of Machine Learning, precision and accuracy are everything. Until recently, Natural Language Processing (NLP) applications have been very limited in their ability to go beyond extracting general language concepts, e.g. known places, monuments, people, and universal representations of time and money. Learning the long-tail nuances of an organization’s own unique data is extremely challenging and most systems fail. Language is ever evolving, and valuable insights are easily overlooked because new terminology emerges or words are used in different contexts.
Take these examples of homonyms:
The bandage was wound around the wound.
The farm was cultivated to produce produce.
The wind was too strong to wind the sail around the mast.
The point is that words can be spelled the same, but have different meanings, depending on the words that surround it and the intended context.
Language is nuanced and complicated. There’s an infinite diversity of sentences that people could write, making predictive models extremely challenging to train. In fact, getting machines to understand natural language is one of the hardest tasks in artificial intelligence today. Computers are accustomed to using fixed rules and if someone goes off script they fail (cough Siri). Language can also be easily misinterpreted. Actually understanding what is really intended (in both written and spoken text) is really really hard for a computer to do.
Text classification in general is a difficult and high-dimension problem, and practical results have lagged behind those of image analysis. A new generation of deep learning neural networks emerged that have applicability to a wide swath of language analysis problems. However, as with all deep learning, these models require labeled training data, and lots of it. While these new models can be made to produce “toy results” using smaller amounts of data, this leads to flaws such as overfitting and missing textual patterns that were not previously seen in the training data. A means of adding much larger quantities of labeled data is needed to take full advantage of the state-of-the-art and advance it for practical commercial applications.
What we are doing with Jaxon is starting to crack the questions “What does it mean when we use these words in sentences?” and “How are sentences related to one another?” and “Is this word close to this other word and why?” – context that is easy for humans, but nearly impossible for machines without the guidance of labels. Jaxon collects, analyzes, and then weighs the evidence to narrow the possibilities. Dozens of algorithms work together to come up with thousands of possible answers. Jaxon then ranks the possibilities by its confidence in those answers. All this happens almost instantaneously.
Data is fed into Jaxon through an elaborate assembly line that dissects, analyzes, and refines itself on the fly. Jaxon dives into the data, discovers correlations, looks for patterns and connections, and comes to conclusions, requiring minimal guidance of a chaperone. When we try to teach a computer how to comprehend the meaning of natural language, we have building blocks to create arguments and then find similarities between topics and concepts and the sentences they are in. The context of the surrounding text is usually the best clue to determine the intended meaning. Jaxon glues these clues together with probabilities and historical knowledge to determine the text’s classification and best label(s) to apply.
With Jaxon, downstream supervised learning applications are able to train on massive amounts of labeled examples, thereby reducing the false positives and false negatives. True vs. False and Positive vs. Negative is the classification game. Training sets with properly labeled data enable systems to learn from the examples, identify trends, and make decisions with minimal human intervention.
Jaxon constantly brings in more data and retrains these downstream consumers (Quality Control loop). The Jaxon Studio is a data scientist’s toolkit with ‘knobs to turn’ to increase precision and recall. This might mean adding more unlabeled and/or labeled examples, curating the model, ensembling, etc., in order to provide quality metrics and automation to help a data scientist discover and refine the right settings for each individual data set.
Jaxon is an ideal companion for use cases where the language and patterns thereof are constantly changing. It’s an iterative process that never ends. Focusing on industries flooded with data, Jaxon is pushing NLP to an elevated level to pinpoint relevant words and phrases to make more accurate decisions and predictions.