Distillation: Making Large Language Models Compute-Friendly

Large language models (LLMs) dominate the field of natural language processing and have been applied to machine translation, language modeling, question answering, and more. However, state-of-the-art language models require a lot of computational resources, making them impractical for many real-world applications. For example, ChatGPT costs OpenAI over $700,000 a day to run. Why? It just takes that much computational power for ChatGPT to figure out recipes for scallion focaccia (to use a personal example).

This is where distillation comes in. It’s a technique for creating compute-friendly LLMs that are suitable for use in resource-constrained environments. Distillation models can be used for real-time language translation, automated speech recognition, and chatbots for customer service on edge devices like smartphones, tablets, and smart watches.

Distillation involves training a smaller model, known as a student model, to mimic the behavior of a larger model, known as a teacher model. The teacher model is usually a state-of-the-art LLM like GPT-4, which has a large number of parameters and requires significant computational resources. The smaller student model can be trained faster and with fewer computational resources.

During the distillation process, the teacher model generates a set of soft targets, which are essentially probability distributions over the possible next tokens in a sentence. The student model is then trained to mimic these soft targets, rather than the actual outputs of the teacher model. This allows the student model to learn from the teacher model’s knowledge without needing to replicate the same level of computational complexity. The result is a lighter LLM that can be used in environments with limited computing resources without sacrificing performance.

Stay in the loop with Jaxon's newsletter!

Compute-friendly LLMs also reduce the overall energy consumption associated with running these models. Large LLMs like GPT-4 require significant amounts of energy to train and run, which can contribute to carbon emissions. By creating more efficient LLMs through distillation, we can reduce the environmental impact of AI and natural language processing and make it more sustainable for the future.

In conclusion, distillation is a powerful technique for creating compute-friendly LLMs that can be used in resource-constrained environments. By training smaller models to mimic the behavior of larger ones, we can create models that require less computational resources without sacrificing performance. This approach not only makes NLP more accessible in resource-limited contexts but also has the potential to reduce the environmental impact associated with large-scale language modeling.

RAG is NOT Enough

Stay in the loop with Jaxon's newsletter!