Long story short, humans are slow.
When humans are labeling text – for example reading a tweet and deciding if it demonstrates positive or negative sentiment – the whole process takes at least 10 seconds. However, a machine can make the same judgement in milliseconds. Machines are simply several orders of magnitude faster, which is a big reason why they’re being used more and more for tasks that used to be reserved for humans.
On top of that, humans are also wildly inconsistent and easily influenced.
Consider studies of eyewitness accounts that demonstrate that memory is not to be trusted – 71% of convictions that were overturned when DNA evidence began to be used in courts had been based on eyewitness testimony. Psychologists have shown that partial recall of false memories can be easily induced in almost half of study participants using only the power of suggestion. Psychologists have also been demonstrating for years how easy it is to influence participant responses based on sentences, phrases, or words they have previously read, conversations they have had with the experimenters or other participants prior to the experiment, or unintentional nonverbal communication by the person administering the experiment or test. Humans, all of us, are just not impartial judges.
This influence is another reason why human labelers are not a good solution for labeling examples. In linguistics, the view is that if a native speaker of a language makes a judgement about a sentence, that it is considered ground truth, a gold standard. But these ‘truths’ don’t always match up across people or contexts. Consider whether each of these sentences is grammatically correct:
- After all this dry weather, the car needs washed.
- After all this dry weather, the car needs to be washed.
Native English speakers should judge number 2 as correct – that one is easy. Many people from around the US would judge number 1 as incorrect, hands down – it is missing the copula ‘to be’. But, if you’re from Pittsburgh, number 1 will sound completely natural and grammatically correct. So if you’re having two people from different regions judging or labeling these sentences, you may get opposite results. How do you get your ground truth now?
Things get even hairier when people are making judgements about something such as sentiment. Consider whether these sentences indicate positive or negative sentiment:
- These Nike kicks I just got are bad.
- I’m quite chuffed with my new Adidas trainers.
- Just bought these new shoes. They’re fine.
Depending on factors like what country you’re from, your race, your age, or your socioeconomic status, each of these examples could be judged to be positive or negative – and both answers would be correct. Without more context or knowing what the writer intended, there’s no knowing which sentiment is actually right. As it turns out, ‘truth’ really is subjective.
In order to combat this, when companies are having humans label training data for AI applications, they generally solicit at least three judgements per label. If the labeling is inconsistent or the task is unclear or ambiguous, more humans will be needed for every label produced to train these applications. These labeling tasks are not set up like experiments with controls and different versions – there’s nothing preventing labelers from being influenced by the labels they have already done, their state of mind, or where they happen to be at the time (most of the time the labeling is being done by people on their personal computers).
Now consider that AI applications need anywhere from several hundred thousand to millions (and sometimes even billions!) of examples for adequate training data, and that each human judgement will cost upwards of 10 cents depending on the examples being judged. That price is, of course, assuming you don’t need an expert to make the judgements, which will cost a LOT more – several orders of magnitude more. That’s money wasted doing a single task several times over to make sure it’s adequate.
Then, once businesses have gone through this process, which takes several weeks or even months, they tend to assume that they’ve now got ground truth labels to train their AI applications with. The labels are 100% accurate, or should be, right? But this is not the case. A study at Stanford found that human-labeled data is only about 87% accurate – not exactly the ‘ground truth’ businesses are paying for.
Then, because the labeled data has taken so long to arrive, there’s a very real possibility that it is already outdated, or will be very soon. Language change happens constantly and companies and industries are frequently shifting focus and direction. What happens when a chatbot suddenly needs to learn how to use and respond to brand new marketing or domain-specific terms? Or teens and young adults find another new way to express good or bad (anyone heard ‘poggers’? ‘zoomer’?) and a sentiment analysis engine now needs to keep up with the changing terminology? Do companies go through the entire, expensive weeks- to months-long labeling process all over again or do they simply settle for AI that is out-of-date and not performing as well as it should be?
These limitations are just some of the reasons why using Jaxon to train other AI applications simply makes sense. The whole process of generating large training sets takes minutes to hours, enables models to be updated and changed as frequently as necessary, and allows companies to not have to settle for ‘good enough’ – they can have the most up-to-date models all the time no matter how fast language, marketing, or focus changes. As AI itself, Jaxon is an impartial labeler that is not susceptible to outside influence. And best of all, Jaxon understands your company’s domain-specific terminology, regardless of how unusual or nuanced, because it learns from your data.
– Charlotte Ruth, Director of Linguistics