DeepMind’s new inference-time scaling technique improves planning accuracy in LLMs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more

Inference-time scaling is one of the big themes of artificial intelligence in 2025and AI labs are attacking it from different angles. In its latest research paper, Google DeepMind introduced the concept of “Evolution of Mind,” a technique that optimizes the responses of large language models (LLMs) for planning and reasoning tasks.

Inference-time scaling techniques attempt to improve the performance of LLMs by allowing them to “think” more when making their answers. In practice, this means that instead of generating its answer in one go, a model is allowed to generate multiple answers, review and correct its answers, and explore different ways to solve the problem. problem.

Advanced LLM answers

Mind Evolution relies on two key components: search and genetic algorithms. Search algorithms are a common feature of many inference-time scaling techniques. They allow LLMs to find the best path of reasoning for the best solution. Genetic algorithms are inspired by natural selection. They generate and evolve a population of candidate solutions to optimize an objective, often called a “fitness function.”

Development of Mind — Algorithm of Mind Evolution (source: arXiv)

Mind Evolution begins by generating a population of candidate solutions expressed in natural language. Solutions are developed by an LLM given a problem description along with useful information and instructions. LLM then evaluates each candidate and improves it if it does not meet the criteria for a solution.

The algorithm then selects the parents for the next generation of solutions by sampling from the existing population, with higher quality solutions having a greater chance of being selected. It then generates new solutions through crossover (selecting parent pairs and combining their elements to create a new solution) and mutation (making random changes to the newly generated ones solution). It again uses the evaluation method to refine new solutions.

The cycle of evaluation, selection and recombination continues until the algorithm reaches the optimal solution or is exhausted in a preset number of iterations.

Refinement process for proposed solutions in the Mind Evolution algorithm (source: arXiv)

One of the important features of Mind Evolution is the evaluation function. Evaluators of inference-time scaling techniques often require that the problem be formalized from natural language into a structured, symbolic representation that can be processed by a solver program. Formalizing a problem may require significant domain expertise and a deep understanding of the problem to identify all the key elements that need to be represented symbolically and how they relate to each other, which limits its applicability.

In Mind Evolution, the fitness function is designed to work with natural language planning tasks where the solutions are expressed in natural language. This allows the system to avoid formalizing problems, as long as there is a program solution evaluator available. It also provides textual feedback in addition to a numerical score, allowing the LLM to understand specific issues and make targeted improvements.

“We focus on evolving solutions in natural language areas instead of formal spaces. This removes the need to formalize the task, which requires significant effort and expert knowledge for each instance of the task,” researchers letter.

Mind Evolution also uses an “island” method to ensure that it explores a diverse set of solutions. At each stage, the algorithm creates separate groups of solutions that improve on their own. It then “transfers” the best solutions from one group to another to combine and create new ones.

Evolution of Mind in planning tasks

The researchers tested Mind Evolution against baselines such as 1-pass, where the model generates only one response; Best-of-N, where the model generates several responses and selects the best; and Sequential Revisions+, a revision technique where 10 candidate solutions are proposed independently, then revised separately for 80 turns. Sequential Revisions+ is the closest to Mind Evolution, although it lacks the genetic algorithm to combine the best parts of the discovered solution. For reference, they also include an additional 1-pass baseline used OpenAI o1-preview.

Performance of the Trip Planning benchmark. As the complexity of the task increases, the gap between Mind Evolution and other methods grows (source: arXiv).

Researchers perform most tests quickly and cheaply Gemini 1.5 Flash. They also explored a two-stage approach, where the Gemini 1.5 Pro model is used when the Flash model does not answer the problem. This two-stage approach provides better cost-efficiency than using the Pro model for every problem instance.

The researchers tested Mind Evolution on several natural language planning benchmarks for tasks such as travel and meeting planning. Previous research has shown that LLMs cannot achieve good performance on these tasks without the help of formal solvers.

For example, Gemini 1.5 Flash and o1-preview achieved a success rate of only 5.6% and 11.7% in TravelPlanner, a benchmark that simulates organizing a trip plan based on user preferences and constraints expressed in natural language. Even taking advantage of the Best-of-N of 800 independently generated responses, Gemini 1.5 Flash only achieved 55.6% success in TravelPlanner.

TravelPlanner benchmark performance. As the complexity of the task increased, Mind Evolution remained consistently high in performance while other methods failed (source: arXiv).

In all of their tests, Mind Evolution outperformed the baselines by a wide margin, especially as the tasks became more difficult.

For example, Mind Evolution achieved a 95% success rate with TravelPlanner. In the Trip Planning benchmark, which involves creating an itinerary of cities to visit with several days in each, Mind Evolution achieved 94.1% of test cases while other methods reached at the highest 77% success rate. Interestingly, the gap between Mind Evolution and other techniques increases as the number of cities grows, showing its ability to handle more complex planning tasks. With a two-stage process, Mind Evolution achieved near-perfect success rates on all benchmarks.

Mind Evolution also proves a cost-effective method for solving natural language planning problems, using a fraction of the number of tokens used by Sequential-Revision+, the only other technique that close to its performance.

“Overall, these results show a clear advantage of an evolutionary strategy that combines a broad search, through stochastic exploration, with a deep search that leverages a LLM for solution refinement,” the researchers wrote.

Daily insights into business use cases in VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. See more VB newsletters here.

An error occurred.