Language models have come a long way, starting from the good old ChatGPT with a context window as small as a few thousand to Gemini 1.5 Pro handling up to a million tokens at once. In search of ways to process more information than the initial context windows could handle, scientists developed retrieval augmented generation (RAG) – which would be tied up to their language models and help retrieve real-time and accurate answers from external documents. But now that we have around 1000 times bigger context windows that can handle a whole encyclopedia, a question naturally arises: Do we still need RAG?
The short answer is yes. The long answer is – it's not only about having more information, but the right information, to make smarter decisions. In this blog post, we discuss the pros and cons of long context windows vs. RAG and dig deeper into why their value is not exclusive. We'll derive why extremely long contexts in LLMs may struggle to focus on relevant information, leading to poor response qualities. To address this problem, we'll also discuss a new approach – OP-RAG, which improves RAG response quality on long-context tasks.
Long-context LLM pros
Long-context LLMs offer easy information input, which tempts people to think they replaced RAG (they didn't). Here are a few pros of LLMs supporting long-context windows.
Quick retrieval
Long-context models like Claude 2 can continuously take in inputs, reason, and retrieve information on the fly. RAG, on the other hand, requires attaching external documents and then using the same data for all of its tasks. This means that while you can constantly put new information in context, for RAG, that will require a few more steps.
Easier to use
RAG involves multiple components – retrieval, embedding model, and the language model. This means, to make your RAG work best, you need to set up the embedding parameters, chunking strategy, and then try and test if your model really gives the correct answers. Overall, a bit more effort than just inputting long prompts.
Handy for simple tasks
If your case is not too complex and you need relatively simple retrieval from large volumes of text, long-context models can be fast and handy. In this case, you may not bother setting up a RAG system if you try it with an LLM and it works just fine. However, later in the article we'll discuss cases when this won't work and RAG will come for help.
RAG pros
While long-context LLMs offer an expansive view, pulling in millions of tokens at once, RAG continues to hold its place in handling data efficiently. Here's why RAG is sticking around.
Complex RAG is here to stay
The simpler forms of RAG, which chunk and retrieve data in trivial ways, might be seeing a decline. But more complex RAG setups are far from fading away.
Today's RAG systems include complex tools like query rewriting, chunk reordering, data cleaning, and optimized vector searches, which enhance their functionality and extend their reach.
RAG can be more efficient
Expanding an LLM's context window to include huge chunks of text certainly comes with its own set of hurdles, especially when you think about the slower response times and the uptick in computing costs. The bigger the context, the more data there is to process, which can really start to add up. On the other side, RAG keeps things lean and mean by retrieving only the relevant and necessary bits of information.
RAG is more resource-friendly
RAG remains the more affordable and faster solution when compared to the extensive processing involved with long-context windows. It allows developers to enhance LLMs with additional context without the hefty time and costs of dealing with enormous data blocks.
RAG is easier to debug and evaluate
RAG is an open book – meaning, you can easily follow a thread from question to answer. This is especially useful for big documents or complex reasoning tasks. It means that RAG helps you easily debug the answer, while putting too much context can be obscure to handle and lead to errors/hallucinations.
RAG is up-to-date
One of RAG's biggest advantages is that it integrates the most current data into the LLM's decision-making process. By connecting directly to updated databases or making external calls, RAG ensures that the information being used is the latest available, which is vital for applications where timeliness is critical.
RAG handles information strategically
In general, LLMs perform best when key information is at the beginning or the end of the input. That means, according to recent research, if you ask a question that refers to the rest of the context, you might be disappointed with the answer.Meanwhile with RAG, there are techniques like reordering of documents, which you can use to strategically change the positions of documents according to their priority. This would be a big manual hurdle if done in context.
Why RAG will stay despite the long-context LLMs trend
A recent study by Nvidia sheds light on an important finding—extremely long contexts can struggle to focus on relevant information and lead to a lack of answer quality. To address this, they've suggested a new approach that revisits RAG in long-context generation—the OP-RAG mechanism.
OP-RAG: Revisiting RAG in long-context generation
OP-RAG is a technique that improves the quality of RAG in long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve.
At certain points, OP-RAG achieves higher quality results with much less tokens than long-context LLMs. Experiments on public benchmarks showed the superiority of OP-RAG, which we'll examine soon.
Chunk order matters for RAG
In this study, researchers took a fresh look at how retrieval-augmented generation (RAG) works with LLMs handling big chunks of text. They found something interesting: the order in which text chunks are retrieved really matters. Instead of the usual way of sorting chunks by how relevant they are, this research tried keeping the chunks in their original order. Turns out, this tweak—called order-preserve RAG—really improved the quality of the answers RAG came up with.
The more chunks order-preserve RAG pulled in, the better the answers got at first. This boost happens because the model gets to see a wider range of relevant info, helping it find the right context to craft more accurate responses. But there's a catch: pulling in too many chunks starts to backfire after a certain point. It brings in a lot of irrelevant or distracting information that can confuse the model, causing the quality of the answers to dip. The key, then, is finding that sweet spot where just enough context is used to improve recall without overloading the model with unnecessary noise.
This approach goes against the grain of some recent studies, which suggest that bigger context windows are always better. For instance, this study showed that using just 16,000 well-chosen tokens with order-preserve RAG could score an impressive 44.43 F1 on the Llama3.1-70B model. That's way better than using all 128,000 tokens without RAG, which only scored 34.32. Even models like GPT-4o and Gemini 1.5-Pro, loaded with tons of data, didn't beat the scores from this approach.
How OP-RAG works
Let's dive into how OP-RAG works. Imagine you have a long document, referred to as 'd'. It is then cut into several sections or 'chunks', making sure each one is sliced in a consistent, sequential order. We label these chunks as c1, c2, and so on, right up to cn, where 'n' represents the total number of chunks.
When someone submits a query, like 'q', we need to identify which chunks are most relevant. This is done by using cosine similarity, which measures how closely the content of each chunk relates to the query. This gives us a score, si, for each chunk, indicating its relevance to the query.
The next step is where OP-RAG stands out. The top chunks are picked based on their relevance scores, but instead of rearranging them by these scores, we keep them in the order they appear in the document. This means if chunk c5 is originally before c7 in the document, it will stay that way in our lineup, regardless of their individual relevance scores.
This method differs from traditional RAG, where chunks are ordered solely by relevance, possibly disrupting the natural flow of information. By keeping the original sequence, OP-RAG helps maintain the logical progression of the text, which is essential for producing coherent and accurate responses.
The figure below visually represents this concept, comparing how traditional RAG and OP-RAG organize text chunks. By keeping to the original document order, OP-RAG avoids potential confusion and ensures that the responses not only address the query accurately but also maintain contextual integrity.
OP-RAG results
OP-RAG was tested against two baseline methods. The first one uses long-context LLMs without any RAG. As you can see in the table below, this method tends to use a lot of tokens, making it both inefficient and expensive. For example, the Llama3.1-70B model without RAG scored a 34.26 F1 on the EN.QA dataset using around 117,000 tokens. In contrast, OP-RAG, using only 48,000 tokens, achieved a much higher 47.25 F1 score.
The second baseline involves the SELF-ROUTE mechanism, which automatically decides whether to use RAG or a long-context LLM based on the model's self-assessment. OP-RAG was shown to outperform this approach too, and it does so using significantly fewer tokens.
Closing remarks
Large language models have come a long way, and can now handle huge amounts of texts. The common thought was that bigger contexts translate to better performance, and we might not even need RAG anymore, but that's not true. In fact, LLMs can fall short when given a large amount of text, and very often hallucinate.
The recently announced OP-RAG developed by Nvidia proves that there are many new ways to use RAG for dealing with lengthy texts. This just proves that complex RAG will stick around, at least for the near future.