Fine-tuning is a key part of improving large language models (LLMs). A year ago, ByteDance scientists developed a new approach called reinforced fine-tuning (ReFT), which was set to change how we train these models. Only a month ago, as part of their 12-day OpenAI event, Sam Altman introduced their version of reinforcement fine-tuning.
Reinforcement fine-tuning, as the name suggests, combines reinforcement learning, where a model learns to make decisions, with the standard fine-tuning process.
Reinforcement fine-tuning: OpenAI approach
On the second day of the “12 Days of OpenAI” event, OpenAI announced a new technique of fine-tuning their models called reinforcement fine-tuning (ReFT).
What is reinforcement fine-tuning?
Reinforcement fine-tuning is a way to improve large language models by training them with a reward-based process. These “frontier models” are already capable of many tasks—like translation, assistance, and coding—but there’s ongoing research into how to fine-tune them efficiently. The aim is to adapt them to specific styles, tones, or narrow domains (e.g., offering medical advice or handling specialized classifications) without using too much computing power or massive labeled datasets.
OpenAI has shown that ReFT can deliver solid fine-tuning results with only a handful of training examples. This efficiency is crucial in fields where data is limited and expensive, such as the medical sector.
ReFT itself comes from reinforcement learning (RL). In RL, an agent learns by receiving positive or negative rewards for its actions. Here, the “actions” are the model’s outputs, which get scored to show how well they meet our expectations. Over multiple rounds of fine-tuning, the model updates its parameters to aim for higher scores.
Reinforcement fine-tuning stages
To make ReFT work, you start with a labeled dataset (split into training and validation sets).
Unlike typical fine-tuning—where the model just tries to match the labeled answers—ReFT encourages the model to reason its way to those answers. A system of “graders” assigns a score (like 0 to 1), guiding the model’s updates. Each time the model produces an output, it gets graded. That score acts as a reward signal, nudging the model’s parameters in a direction that should yield better results. As this process repeats, the model’s performance is checked on the validation set to confirm its learning effectively rather than just memorizing.
So, a basic reinforcement fine-tuning workflow would involve:
- Preparing a labeled dataset (split into training and validation).
- Using “graders” to assign scores that guide the model’s learning process.
- Repeating training rounds and checking performance on the validation set.
- Watching for genuine improvements rather than rote memorization.
Although the core concept is straightforward, the practical details can vary. Still, results so far look promising. Even with as few as 1,100 examples, ReFT pushed a smaller “o1-mini” model to surpass the larger standard “o1” model, highlighting how targeted training plus reward signals can lead to impressive gains.
Reinforced fine-tuning (ReFT): ByteDance approach
Current math-solving approaches revolve around supervised fine-tuning (SFT) methods and chain-of-thought (CoT) annotations, which often result in a model that doesn't generalize well. This happens because there is one CoT annotation for each question in the training data, while very often, there are several CoT annotations for the same questions.
To address this issue and offer a more versatile way to solve advanced reasoning problems, ByteDance scientists came up with reinforced fine-tuning (ReFT) in January 2024.
Reinforced fine-tuning (ReFT) starts with supervised fine-tuning (SFT), typically lasting one or two cycles. During this phase, the model gains an essential capability to solve mathematical problems correctly. Following this, ReFT takes the model's training to the next level by employing a reinforcement learning (RL) algorithm using methods like proximal policy optimization (PPO). This advanced stage allows the model to explore and learn from various correct solutions and reasoning methods.
What makes ReFT efficient in this context is its use of the existing training data, which already includes the correct answers. These answers form the basis for the rewards in the PPO training process, eliminating the need for an additional, separately trained reward system. This is a vital difference from other methods like RLHF, which rely on rewards determined from human-annotated data.
ReFT stages
The reinforced fine-tuning (ReFT) process is divided into two main stages: warm-up and reinforcement learning. Let’s learn these in more detail.
Warm-up stage
In this initial stage, the model undergoes fine-tuning for a few cycles using a dataset composed of question and chain-of-thought (CoT) pairs. This stage is crucial for imparting basic problem-solving skills to the model, enabling it to generate appropriate responses to questions. The process involves predicting a sequence of tokens that form the CoT, ending with a special token indicating the end of the sequence. The model learns to generate CoT by sampling actions from a policy and updating its state with each action taken. In this initial stage, ReFT achieves a foundational level of accuracy.
Reinforcement learning stage
After the warm-up, the model enters the reinforcement learning stage, where it enhances its performance through online self-learning. This stage uses a dataset of question-and-answer pairs. The model learns by repeatedly sampling responses, assessing the correctness of these responses, and updating its parameters accordingly. The training employs proximal policy optimization (PPO) with a specific algorithm. Rewards are given based on the correctness of the answers derived from the model's CoT compared to the ground-truth answers, with a reward system that assigns a value based on the accuracy of the final answer.
The fact that ReLF learns from diverse CoT reasoning strategies results in its more comprehensive learning experience compared to SFT alone. Consequently, ReFT performs better in generalizing math problem-solving skills, utilizing the same training questions as SFT but without needing additional or modified training material. Moreover, ReFT's methodology is consistent with data engineering techniques, allowing it to integrate smoothly into existing training frameworks.
Methods
In their study, the authors compare reinforced fine-tuning (ReFT) with supervised fine-tuning (SFT) and two types of self-training methodologies: offline self-training (OfflineST) and online self-training (Online-ST). SFT is a basic method where the language model is fine-tuned on training data, while the self-training approaches utilize model-generated samples for further training.
OfflineST: OfflineST involves using early SFT results to generate chain-of-thoughts (CoTs) and retaining only those that match the ground truth. These are then combined with the original training data for further fine-tuning.
OnlineST: OnlineST is designed to be similar to ReFT and also starts with a warm-up process. It continuously trains the model with newly generated CoTs, selecting only those with correct answers for further model updates.
The experimental setup involved two foundational models, Galactica-6.7B and Codellama-7B, known for their proficiency in solving math problems. Techniques like majority voting and reward model reranking were applied to enhance the results further. The training utilized advanced computational resources, and specific parameters were set for the warm-up stage, the number of training epochs, and the learning rate. The ReFT approach was rigorously tested against these baselines to validate its effectiveness.
Results
In comparative studies shown in Table 1, reinforced fine-tuning (ReFT) consistently outperforms supervised fine-tuning (SFT) and self-training methods across various datasets, including GSM8K, SVAMP, and MathQA. The studies highlight ReFT's significant improvements in performance, with notable gains over SFT in both CodeLLAMA's GSM8K N-CoT and P-CoT evaluations. For instance, ReFT achieved over 9-point and 8-point improvements in these areas, respectively. On average, with CodeLLAMA across all datasets, ReFT marked improvements of 3.7 points in N-CoT and 5.9 points in P-CoT settings.
Notably, these results were achieved without additional annotations or specialized reward models, underscoring ReFT's robust generalization capabilities. This demonstrates the potential of using reinforcement learning to explore training data more effectively
The comparison also reveals that while offline self-training can sometimes enhance performance over SFT, the improvements are not as pronounced as those achieved by ReFT. This indicates that the exploratory nature of ReFT is crucial for its success. Although online self-training showed some gains with the Galactica model, it still lagged behind ReFT overall. This suggests that incorporating incorrect instances is vital for guiding the model towards more effective exploration. Compared to self-training approaches, the superior performance of ReFT indicates the effectiveness of its on-policy sampling and reinforcement learning over standard data augmentation methods.
Reward hacking
The research reveals that ReFT struggles with reward hacking in MathQAMCQ's multiple-choice format, leading to inaccuracies in reinforcement learning training. This issue arises when the model incorrectly rewards a chain-of-thought (CoT), leading to an erroneous conclusion. To mitigate this, the researchers used MathQA version that requires direct numeric answers, eliminating the choice-based questions. The results show ReFT's consistent superiority over SFT with both Galactica and CodeLLAMA models.
Majority voting and reward reranking
Furthermore, ReFT benefits from majority voting and reward model reranking, outperforming SFT in these scenarios, as evidenced in Table 2. Compared with other open-source methods, ReFT's best variant shows remarkable performance, especially in the CodeLLAMA + ReFT + Reranking setup, which achieves notable accuracy and even rivals GPT-3.5-turbo despite being a smaller 7B model.
Final remarks
Reinforcement fine-tuning offers a powerful approach to adapt large language models for specialized tasks. By combining reward-based training loops with a carefully curated dataset and grading system, models learn not just the “right answers” but also the reasoning behind them. This leads to more robust performance and adaptability, even when training data is scarce. Recent outcomes—like smaller, ReFT-optimized models outperforming larger baselines—underscore just how effective this method can be, pointing the way toward more efficient and domain-focused language applications.