Google’s new neural-net LLM architecture separates memory components to control the exploding cost of capacity and computation

Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more

A new neural-network architecture developed by Google researchers could solve one of the biggest challenges for large-scale language models (LLMs): expanding their memory during inference without exploding costs. in memory and computing. Called Titansthe architecture enables models to search and save during inference small pieces of information that are important in long sequences.

Titans combines traditional LLM attention blocks with “neural memory” layers that enable models to handle short- and long-term memory tasks effectively. According to the researchers, LLMs using neural long-term memory can measure millions of tokens and outperform both classic LLMs and alternatives such as Mamba while having fewer parameters. .

Attentional layers and linear models

The classic transformer architecture used in LLMs uses the mechanism of self-attention to calculate relationships between tokens. It is an effective technique to learn complex and granular patterns in token sequences. However, as the sequence length grows, the computational and memory costs of computing and storing attention increase quadratically.

New proposals include alternative architectures which has linear complexity and can be scaled without exploding memory and computational cost. However, Google researchers argue that linear models do not show competition compared to classic transformers, because they compress their contextual data and tend to miss important details.

The ideal architecture, they suggest, should have different memory components that can be coordinated to use existing knowledge, memorize new facts, and learn abstractions from their context. .

“We argue that in an effective learning paradigm, similar to (the) human brain, there are distinct but interrelated modules, each of which is responsible for an important part of the learning process, ” wrote the researchers.

Neural long-term memory

“Memory is a confederation of systems—for example, short-term, working, and long-term memory—each serving a distinct function with different neural structures, and each capable of operating independently. ,” the researchers wrote.

To fill the gap in existing language models, the researchers proposed a “neural long-term memory” module that can learn new information during inference without the inefficiencies of absolute attention mechanism. Instead of storing information during training, the neural memory module learns a function that can memorize new facts during inference and dynamically adapt the memorization process based on the data it encounters. This solves the generalization problem that other neural network architectures suffer from.

To decide which pieces of information are worth keeping, the neural memory module uses the concept of “surprise.” The more an array of tokens differs from the type of information stored in the weights of the model and the current memory, the more surprising it is and therefore worth memorizing. This allows the module to use its limited memory efficiently and store only those pieces of data that add useful information to what the model already knows.

To handle very long data sequences, the neural memory module has an adaptive forgetting mechanism that allows it to delete information that is no longer needed, which helps to manage limited memory capacity.

The memory module can be complementary to the attention mechanism of current transformer models, which the researchers describe as “short-term memory modules, which take care of the current size of the context window. In on the other hand, our neural memory with the ability to continuously learn from data and store it in its weights can play the role of a long-term memory.

The architecture of Titan

Example of Titan architecture (source: arXiv)

The researchers describe the Titans as a family of models that include existing transformer blocks with neural memory modules. The model has three key components: the “core” module, which acts as short-term memory and uses the classic attention mechanism to take care of the current portion of the input tokens processed by the model; a “long-term memory” module, which uses neural memory architecture to store information beyond the current context; and a “persistent memory” module, the learnable parameters of which remain fixed after training and store knowledge independently of time.

Researchers have proposed different ways to connect the three components. But in general, the main advantage of this architecture is to enable the attention and memory modules to work together. For example, attention layers may use historical and current context to determine which parts of the current context window should be stored in long-term memory. Meanwhile, long-term memory provides historical knowledge that is not in the current context of attention.

The researchers conducted small tests of Titan models, ranging from 170 million to 760 million parameters, on a variety of tasks, including language modeling and long-sequence language tasks. . They compared Titans performance against various transformer based models, linear models like Mamba and hybrid models like Samba.

Titans (red line) outperforms other models, including GPT-4, in long sequence tasks in some shots and good settings (source: arXiv)

Titans shows a strong performance in language modeling compared to other models and outperforms transformers and linear models of the same size.

The difference in performance is most pronounced in tasks of long ranges, such as “needle in a haystack,” where the model must extract a piece of information from a very long sequence, and BABYLONwhere the model must reason with facts distributed over very long documents. In fact, in these tasks, Titan outperforms models with orders of magnitude more parameters, including GPT-4 and GPT-4o-miniand a Llama-3 model developed by retrieval-augmented generation (RAG).

In addition, the researchers were able to expand the Titans context window up to 2 million tokens while keeping memory costs at a moderate level.

The models still need to be tested in larger sizes, but the results from the paper show that the researchers still haven’t hit the ceiling of the Titans’ potential.

What does this mean for business applications?

With Google mainly high context modelswe can expect this technique to find its way into private and open models such as Gemini and Gemma.

With LLMs supporting higher context windows, there is growing potential for creating applications where you squeeze new knowledge into your prompts instead of using techniques like RAG. The development cycle for developing and iterating instant-based applications is much faster than complex RAG pipelines. Meanwhile, architectures like Titans help reduce inference costs for very long ranges, making it possible for companies to deploy LLM applications for more use cases.

Google plans to release PyTorch and JAX code for training and evaluating Titans models.

Daily insights into business use cases in VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. See more VB newsletters here.

An error occurred.