Google DeepMind researchers introduce new benchmark to improve LLM reality, reduce hallucinations

Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more

Hallucinationsor inaccurate answers, continue to plague large language models (LLMs). Models are especially weak when they are given more complex tasks and when users are looking for specific and very detailed answers.

It’s a challenge that data scientists struggle with, and now, researchers from Google DeepMind says that they are one step closer to achieving the true reality of the foundation models. They introduced FACTS Grounding, a benchmark that evaluates the ability of LLMs to generate factually accurate answers based on long-form documents. Models are also judged on whether their responses are detailed enough to provide useful, relevant responses to the prompts.

With the new benchmark, the researchers released a FACTS leaderboard to the Kaggle data science community.

This week, Gemini 2.0 Flash topped the leaderboard, with a factuality score of 83.6%. Others in the top 9 include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. All of these rank above 61.7% in terms of accuracy.

The researchers say the leaderboard is actively maintained and continuously updated to include new models and their various iterations.

“We believe that this benchmark fills a gap in evaluating a wider variety of model behaviors related to reality, compared to benchmarks that focus on narrower cases of use … technical paper was published this week.

Eliminate inaccurate answers

ensuring true precision in LLM answers are difficult due to modeling (architecture, training and inference) and measurement (evaluation methods, data and metrics) factors. Often, the researchers point out, pre-training focuses on predicting the next token given the previous tokens.

“Although this objective can teach the models important knowledge of the world, it does not directly optimize the model in different scenarios of reality, instead encouraging the model to generate the most. Compelling text,” the researchers wrote.

To address this, the FACTS dataset includes 1,719 examples – 860 public and 859 private – each requiring long-form responses based on the context of the given documents. Each example includes:

A system prompt (system_instruction) with general directives and the command to respond only based on the given context;
A task (user_request) that includes a specific question to be answered;
A long document (context_document) containing the required information.

To succeed and be labeled “correct,” the model should process the long-form document and produce a subsequent long-form response that is both comprehensive and fully factored into the document. Responses were marked as “inaccurate” if the model’s claims were not directly supported by the document and were not particularly relevant or useful.

For example, a user can ask a model to summarize the main reasons why a company’s revenue decreased in Q3, and provide it with detailed information including an annual financial report of a company that deals with quarterly income, expenses, planned investments and market analysis.

If a model in the past, say, came back: “The company faced challenges in Q3 that affected its revenue,” it would be considered inaccurate.

“The response avoids specifying any factors, such as market trends, increased competition or operational failures, which are likely to be present in the document,” the researchers pointed out. “It does not indicate an attempt to join or obtain relevant details.”

Conversely, if a user prompts, “What are some tips for saving money?” and provides a compilation of categorized money-saving tips for college students, the correct answer can be very detailed: “Take advantage of free activities on campus, buy things in bulk and cook at home. Also, set spending goals, avoid credit cards and save resources.

DeepMind uses LLMs to judge LLMs

To allow for different inputs, the researchers included documents of different lengths, up to 32,000 tokens (or the equivalent of 20,000 words). This covers areas including finance, technology, retail, medicine and law. User requests are also extensive, including Q&A generation, requests for summaries and rewrites.

Each example is judged in two rounds. First, the answers are evaluated for eligibility: If they do not satisfy the user’s requests, they are disqualified. Second, the answers must be hallucinatory and completely based on the documents provided.

These truth scores are calculated by three different LLM judges – specifically Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet – who determine individual scores based on the percentage of accurate model outputs. . Afterwards, the final determination of truth is based on the average of the three judges’ scores.

The researchers pointed out that the models were often discriminated against by other members of their model family – at a rate of 3.23% – so a combination of different judges was important to help ensure that the answers were correct. .

Finally, researchers emphasize that authenticity and fundamentals are key factors in the future success and usefulness of LLMs. “We believe that comprehensive benchmarking methods, combined with continuous research and development, will continue to improve AI systems,” they wrote.

However, they also agree: “We realize that benchmarks can quickly catch up with development, so this launch of our FACTS Grounding benchmark and leaderboard is just the beginning.”

Daily insights into business use cases in VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. See more VB newsletters here.

An error occurred.