More on RAG: How cache-enhanced generation reduces latency, complexity for small workloads

Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more

Retrieval-augmented generation (RAG) has become the de-facto way to customize large language models (LLMs) for specialized information. However, RAG comes with technical costs upfront and can be slow. Now, thanks to the developments of LLMs long-term context, businesses can avoid RAG by inserting all proprietary information in the prompt.

A new study at National Chengchi University in Taiwan showed that by using long-context LLMs and caching techniques, you can create customized applications that outperform RAG pipelines. Called cache-augmented generation (CAG), this approach can be a simple and efficient replacement for RAG in business settings where the knowledge corpus can fit into the context window of the model.

RAG limitations

RAG is a effective method for handling open-domain queries and special tasks. It uses retrieval algorithms to gather documents relevant to the request and add context so that LLM can generate more accurate responses.

However, the RAG identifies several limitations to LLM applications. The extra step of extraction introduces latency which degrades the user experience. The result also depends on document selection quality and step rank. In many cases, the limitations of the models used for extraction require documents to be broken into small fragments, which can damage the extraction process.

And in general, RAG adds complexity to the LLM application, which requires the development, integration and maintenance of additional components. The added overhead slows down the development process.

Get added to the cache

*RAG (top) vs CAG (bottom) (source: arXiv)*

An alternative to developing a RAG pipeline is to insert the entire document corpus into the prompt and let the model choose which bits are relevant to the request. This approach removes the complexity of the RAG pipeline and the problems caused by extraction errors.

However, there are three key challenges to front-loading all prompt documents. First, high prompts slow down the model and increase inference costs. Second, the LLM context window length sets limits on the number of documents that will fit in the prompt. And finally, adding irrelevant information to the stimulus can confuse the model and reduce the quality of its responses. So, just filling all your prompt documents instead of choosing the most suitable one can hurt the performance of the model.

The proposed CAG approach uses three key trends to overcome these challenges.

First, advanced caching techniques make it faster and cheaper to process quick templates. The premise of CAG is that knowledge documents are included in every prompt sent to the model. Therefore, you can calculate the attention values of their tokens in advance instead of doing it when receiving requests. This computation first reduces the time required to process user requests.

Leading LLM providers such as OpenAI, Anthropic and Google provide features to cache recurring parts of your prompt, which may include knowledge documents and instructions you enter at the beginning of your prompt. With Anthropic, you can reduce costs up to 90% and latency in 85% of the cached parts of your prompt. Corresponding caching features have been developed for open-source LLM-hosting platforms.

Second, high context LLMs makes it easy to fit multiple documents and knowledge into prompts. Claude 3.5 Sonnet supports up to 200,000 tokens, while GPT-4o supports 128,000 tokens and Gemini up to 2 million tokens. This makes it possible to include multiple documents or entire books at the prompt.

And finally, advanced training methods allow models to perform better retrieval, reasoning and question answering on very long sequences. In the past year, researchers have developed several LLM benchmarks for long-range tasks, including BABYLON, LongICLBenchand RULER. These benchmarks test LLMs on difficult problems such as multiple retrieval and multiple question answering. There is still room for improvement in this area, but AI labs continue to improve.

As new generations of models continue to expand their contextual windows, they can process larger collections of knowledge. Furthermore, we can expect that models will continue to improve their abilities to extract and use relevant information from high contexts.

“These two trends will greatly extend the usability of our method, enabling it to handle more complex and diverse applications,” the researchers wrote. “Therefore, our approach is well-designed to be a robust and versatile solution for knowledge-intensive tasks, leveraging the growing capabilities of next-generation LLMs.”

RAG vs. CAG

To compare RAG and CAG, the researchers ran experiments on two widely recognized question-answer criteria: SQuADwhich focuses on Q&A knowing the context from a document, and HotPotQAwhich requires multi-hop reasoning over multiple documents.

They use a Daga-3.1-8B model with a 128,000-token context window. For RAG, they combined the LLM with two retrieval systems to retrieve passages relevant to the question: the basic BM25 algorithm and OpenAI embeddings. For CAG, they inserted multiple documents from the benchmark into the prompt and let the model itself determine which sentences to use to answer the question. Their experiments showed that CAG outperformed both RAG systems in most situations.

*CAG outperforms both sparse RAG (BM25 extraction) and dense RAG (OpenAI embeddings) (source: arXiv)*

“By preloading the entire context from the test set, our system eliminates retrieval errors and ensures holistic reasoning of all relevant information,” the researchers wrote. “This advantage is particularly evident in scenarios where RAG systems may capture incomplete or irrelevant passages, leading to suboptimal response generation.”

CAG also greatly reduces response generation time, especially by increasing the length of the text reference.

*The generation time for CAG is less than RAG (source: arXiv)*

That said, CAG is not a silver bullet and should be used with caution. It is suitable for settings where the knowledge base does not change frequently and small enough to fit the context window content of the model. Businesses should also be careful in cases where their documents contain conflicting facts based on the context of the documents, which may confuse the model during the interview.

The best way to determine if CAG is good for your use case is to run some experiments. Fortunately, the implementation of CAG is very easy and should always be considered as a first step before investing in further development of RAG solutions.

Daily insights into business use cases in VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. See more VB newsletters here.

An error occurred.