Fusion of knowledge in large language models (LLMs) is a new emerging technique in the world of AI. This isn't about starting from scratch to build new AI models, which can be quite expensive and time-consuming. Instead, it's about an innovative, efficient approach - merging the capabilities of already existing, well-trained language models to forge a more powerful AI tool.
Picture this: we blend the unique strengths of renowned models like Llama-2, MPT, and OpenLLaMA. It's somewhat like assembling a team of experts, each excelling in their field, to work together. The result is an AI that doesn't just master one area but is remarkably adept across various tasks.
Existing approaches
The idea of developing custom large language models (LLMs) has recently become increasingly popular. However, the high costs and environmental impact of creating these models from scratch present significant challenges. The existing approaches, like ensemble or weight merging, solve many tasks efficiently but still lag at others.
Model ensemble uses multiple models concurrently, each contributing to the final output. This method doesn't merge models but aggregates their outputs for improved results. Techniques like weighted average and majority voting are typical examples. They pool predictions from different models, aiming to leverage their collective strengths.
Weight merging, on the other hand, combines the actual parameters of different models. This technique fuses weights from models with identical structures but trained under different conditions. This approach aims to create a more robust and versatile model capable of handling a wider range of tasks or domains. It's a more profound integration than model ensemble, operating at the parameter level rather than output.
Despite their effectiveness, both methods have limitations. Model ensemble relies heavily on the individual strengths of each model without creating a new, unified system. While more integrated, weight merging requires models with similar architectures, restricting its applicability.
Knowledge distillation
Knowledge distillation trains a smaller “student” model using larger “teacher” models. This technique is popular in natural language processing for improving text classification. The student model learns to mimic the teacher's output, including their distribution and features. For text generation tasks, the focus shifts to reducing the differences in how the student and teacher models generate text, often by guiding the student with the teacher's outputs.
While similar to this concept, the fusion of knowledge approach diverges in two significant ways. Unlike traditional knowledge distillation, where the student model is smaller, we place no size restrictions on our model. More importantly, we expect our model to outperform its teacher models post-distillation, a notable departure from the usual outcome in traditional methods.
Fusion of LLMs
The paper “Knowledge Fusion of Large Language Models” introduces FuseLLM, a novel method focusing on the probabilistic distributions of source LLMs. This technique aligns and fuses these distributions during training, aiming to minimize the divergence between the target and source models. The effectiveness of FuseLLM was demonstrated using three distinct LLMs - Llama-2, OpenLLaMA, and MPT - across various tasks. The fused model showed superior performance, indicating the potential of this approach in efficiently combining the strengths of different LLMs.
In the FuseLLM method, different language models are integrated in a unique, efficient way. This integration involves two critical components: token alignment and fusion strategies.
Token Alignment: The key to successful knowledge fusion across multiple LLMs is accurately aligning their tokens. Tokens are the basic language processing units in LLMs, and aligning them ensures that the models interpret and respond to inputs consistently. If two tokens from different models match precisely, their probability distributions align perfectly. However, in cases where there isn't a direct match, the distribution turns into a simple one-hot vector.
The researchers have modified this approach to recognize that different tokenizers often produce slightly varied tokens for the same text. Instead of insisting on an exact match, researchers have introduced a minimum edit distance (MinED) strategy. This strategy is more flexible, allowing close but not identical tokens to be considered aligned. This approach maintains much information in the distribution matrices while only introducing minimal errors.
Fusion Strategies: In fusing various large language models, our strategy focuses on blending their knowledge while maintaining their individual strengths. To do this effectively, we assess each LLM's quality using cross-entropy loss, which compares their predictions with the ideal or 'gold' labels. This measure helps us understand how well each LLM comprehends the text.
Based on their cross-entropy scores, we then determine the importance of each LLM's predictions. Lower scores indicate a more accurate understanding; hence, we give more weight to these models' outputs. With this approach, researchers have developed two key fusion functions:
1. MinCE: This function selects the model with the lowest cross-entropy score, prioritizing the most accurate prediction for each case.
2. AvgCE: Instead of picking just one, this function calculates a weighted average of all models' predictions, with the weights determined by their cross-entropy scores. These methods ensure a balanced and effective combination of the strengths of different LLMs.
Results
Table 1 shows how the FuseLLM method stacks up against standard methods in the BBH benchmark. We see different performance levels among the three source LLMs across 27 BBH tasks, with Llama-2 often leading the pack. After further training Llama-2 with a varied, concise dataset, its performance sees a slight boost of 1.86%. This increase, however, varies across different tasks. Overall, FuseLLM achieves an average performance gain of 5.16% over the original Llama-2 across all tasks. For some tasks, like the Hyperbaton task, FuseLLM's improvement is quite remarkable, jumping from 54.40 to 65.20. In tasks where simple ongoing training decreases performance, FuseLLM uses the combined strengths of its source LLMs to regain performance. However, there are times, like in the geometric shapes and word sorting tasks, when FuseLLM's performance drops. This could be due to two factors: the other source LLMs, besides Llama-2, not doing well in these tasks, and the mismatch between the training dataset and the tasks.
The second table shows that FuseLLM surpasses baselines on all tasks on the CS benchmark, and the third table indicates a 9 out of 10 pass rate of FuseLLM.
Wrapping up
By innovatively merging existing models, FuseLLM overcomes the limitations of traditional methods, creating a more versatile and powerful language model. With its unique approach to token alignment and fusion strategies, it demonstrates remarkable performance gains in benchmarks, showcasing the potential of this method in enhancing LLM capabilities efficiently.
Disclaimer: This post is informed by research from the scholarly article “Knowledge Fusion of Large Language Models,” authored by multiple contributors.