Reinforcement learning from human feedback (RLHF) is taking the world of large language models by storm, unlocking the full potential of AI with the secret ingredient: human insight. However, this breakthrough comes with its own challenges, notably the high cost of integrating human feedback. Enter the groundbreaking work by a team from Google and Stanford, who have experimented with a creative approach in their study, 'Efficient Exploration for LLMs.' By introducing double Thompson sampling and an epistemic neural network, they've not only tackled the efficiency problem head-on but also achieved remarkable outcomes. In this blog post, we'll see why human feedback matters and how it's implemented in active learning to boost LLMs.
Why is human feedback important?
Large language models (LLMs) have made such a significant impact by learning from vast amounts of text. They start off smart, but there's a special way to make them even smarter: by using feedback from people. When people interact with these models, especially through chatbots, they can tell the model what's good and what's not. This feedback helps the models learn even faster and better understand what we want.
Now, imagine a model getting smarter by learning from all these interactions. There's a chance it could start coming up with ideas or solutions that no human has thought of before. But how do we make sure it keeps getting these great ideas? That's where the idea of "active exploration" comes in. Instead of just waiting for feedback to come in any old way, this method tries to get the most helpful feedback quickly. This could mean we don't have to wait too long to see these models do amazing things; it could happen much sooner. This is the innovation researchers from Google and Stanford brought with their new paper.
Methods
The paper introduces "active exploration" and shows that the model can achieve impressive results with relatively less feedback. It discusses a common way of learning from feedback – sending queries to human raters, each containing a prompt and a pair of responses. Prompts come from a large corpus, and responses are generated by LLMs. In this process, a reward model is fit to the data, and the subsequent responses are aligned with the feedback received. Authors call the standard practice of sampling each pair of responses using the language model as passive exploration. This method is compared to 2 main methods of active exploration algorithms:
- Boltzmann exploration: selects responses with higher predicted reward.
- Epistemic neural network: has 2 main types
- Infomax: Selects a pair of responses that maximizes information revealed by the feedback.
- Double Thompson: Samples responses according to the probability that they're optimal.
Comparing passive vs. Boltzmann vs. infomax methods
In the plot below, you can see the queries required by double TS vs. alternatives to attain different levels of performance. Let’s break it down.
The X-axis shows how many questions Thompson sampling needed to ask to get good at a task, and the Y-axis shows how many questions other methods needed. The plotted points represent levels of performance attained.
- The blue plot of passive exploration clearly shows that double TS requires much fewer queries to achieve higher performance.
- Boltzmann exploration performed better than other algorithms that used only a point estimate reward model (without uncertainty estimates). The red plot of Boltzmann shows that Double TS derives dramatic improvement using point estimate reward mode.
- The green infomax plot shows that even among tried and tested algorithms that use uncertainty estimates, the choice of exploration algorithm can result in big performance differences. Double TS requires relatively more queries for higher performance compared to infomax than passive and Boltzmann algorithms.
The main take from the plot is that passive exploration requires a lot more resources to achieve good results than the active learning methods.
Experimentation
To try and test the findings, the researchers used Anthropic datasets and Gemini’s Nano and Pro versions as pre-trained models.
The human feedback simulator generates a binary preference expression between the responses for each query. The process includes two processes:
Learning pipeline: Controls the interface between the agent and the human feedback simulator in the process of sequential querying and learning.
Assessment pipeline: Controls the interface between the pre-trained model, the new response generation model, and the human feedback simulator in the process of assessing relative performance.
How does the agent learn?
The figure below shows the agent’s learning process.
Let’s break down the process.
The agent crafts each query and presents it to the human preference simulator, which selects one of the responses. Over each epoch, the agent sends B bits of queries and receives B bits of feedback back. The prompts are sampled from the Anthropic Helpfulness Base train dataset.
And how are the responses generated? When receiving prompts, each agent designs the pairs of responses by generating N candidates with Gemini Nano and then using an exploration algorithm to select two among these N. The exploration scheme then accesses a reward model, which is trained from the feedback and responses from the training data formed thus far.
For some of the agents, the reward model is an epistemic neural network where the exploration algorithm has access to uncertainty estimates besides the point estimates of the reward.
Each reward model builds on the “torso” of the Gemini Nano model. This means that the reward model first computes the last-layer embedding of the pre-trained transformer model, after which it applies a multilayer perceptron head (MLP).
Assessing the agent’s performance
To simulate a similar logic to the way humans choose between responses, a reward model is used to score each prompt-response pair. The preference for each query is sampled according to the Bradley-Terry choice model based on scores assigned to prompt-response pairs. The reward model is fit to Anthropic datasets, and its architecture again uses the torso of Gemini Pro.
By the way, since Gemini Pro is far larger than Gemini Nano, choices are made using a much more complex model than that available to the agent. This difference in scale is intended to reflect the fact that humans may exhibit more complex behavior than that modeled by the agent.
The figure below outlines how we check our agent's performance against the Gemini Nano model.
The researchers used a set of test questions and, for each, compared two answers: one from Gemini Nano and one from our new model that chooses the best response based on learned criteria. A simulator predicts which answer people would prefer, and by averaging these predictions, they calculate the agent's win rate—how often its answers are preferred over Gemini Nano's. This process gives a clear measure of the agent's success in providing preferred responses.
The experiment doesn’t use the usual complex methods for tweaking rewards. Instead, an agent checks out several responses from the Gemini 4 model and picks the top scorer. This best-of-N strategy mimics the fancy optimization tricks without all the heavy lifting or the deep dive into settings usually needed. It's a cleaner, more straightforward way of doing things. By tweaking how many responses the agent considers, it finds a sweet spot between sticking to the script and chasing the best rewards, making the whole process smoother and more focused on getting good results.
Reward model architecture
In the setup, reward models help pick the best responses during both learning and testing. There are two kinds of reward models used, both tailored to data on preferences. One gives a straightforward reward for each question-answer match. The other adds a twist with an epistemic index, introducing a chance to reflect uncertainty in how rewards are given.
Picture this: a question-answer pair goes through the language model's core processing unit, and then the reward model steps in to assign its value, as shown in the figure below.
Results of exploration algorithms
The figure below shows how different agents improve their win rates over time with more practice, based on results from five different starting points. It highlights that agents who explore more actively learn faster and end up winning more often. Specifically, the agent using double TS stands out as the best.
Early on, infomax looked promising but didn't keep up with double TS in the long run. This might be because infomax loves to gather information, even when it won't necessarily lead to better outcomes.
As we look at the trends in the plot, it seems like all the agents eventually hit a plateau, where doing more of the same doesn't really make them any better. This plateau is tied to the reward model's capacity, which you can think of as how much it can learn from the feedback it gets. Once it learns all it can, given its capacity, more data doesn't help much. But, if you boost the model's capacity, it can keep improving with more data, though this requires more computing power. This idea ties back to a suggestion by Arumugam & Van Roy in 2021, noting that adjusting the complexity of what an agent is trying to learn based on how long it plans to keep learning can be a smart move.
Wrapping up
Diving into how we teach big AI models to learn from us, a recent study has found some pretty clever ways to do it better and faster. By combining human feedback with new tricks like double Thompson sampling, researchers from Google and Stanford are onto something big. It's all about making AI smarter by learning directly from our input without wasting time or resources. This breakthrough could mean AI that understands us better and comes up with ideas we've never even thought of, all thanks to a smarter way of learning from human insights.