AI has not fared well in history, the new paper found


AI can succeed in certain tasks like coding or create a podcast. But it struggles to pass a high-level history exam, a new paper finds.

A group of researchers created a new benchmark to test the three leading large-scale language models (LLMs) — OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini — on historical queries. The benchmark, Hist-LLM, tests the correctness of answers according to the Seshat Global History Databank, an extensive database of historical knowledge named after the ancient Egyptian goddess of wisdom.

The results, that presented last month at the high-profile AI conference NeurIPS, was disappointing, according to researchers associated with Complex Science Hub (CSH), a research institute based in Austria. The best performing LLM is GPT-4 Turbo, but it only achieves about 46% accuracy – not much higher than random guessing.

“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding needed for advanced history. They are good for basic facts, but when it comes to the more nuanced , PhD-level historical inquiry, they are not yet up to the task,” said Maria del Rio-Chanona, one of the paper’s co-authors and associate professor of computer science at University College London.

Researchers share sample questions with TechCrunch that LLMs get wrong. For example, GPT-4 Turbo was asked if scale armor existed in a certain time period in ancient Egypt. LLM said yes, but the technology only appeared in Egypt 1,500 years ago.

Why are LLMs so bad at answering technical history questions, when they can be so good at answering more complex questions about things like coding? Del Rio-Chanona told TechCrunch that’s likely because LLMs tend to extrapolate from historical data that is too prominent, making it difficult to capture more obscure historical knowledge.

For example, researchers have asked GPT-4 if ancient Egypt had a professional standing army at a certain period in history. While the correct answer is no, LLM answers that it is wrong to do so. This is likely because there is more public information about other ancient empires, such as Persia, that had standing armies.

“If you’re told A and B 100 times, and C 1 time, and then asked about C, you can remember A and B and try to extrapolate from that,” says del Rio-Chanona .

The researchers also identified other trends, including the OpenAI and Llama models performing worse for some regions such as sub-Saharan Africa, suggesting potential biases in their training data. .

The results show that LLMs are still not a substitute for humans when it comes to some domains, said Peter Turchin, who led the study and is a CSH faculty member.

But researchers still hope that LLMs will help historians in the future. They are working to refine their benchmark by including more data from underrepresented regions and adding more complex questions.

“Overall, while our results highlight areas where LLMs need improvement, they also highlight the potential for these models to aid historical research,” the paper reads.



Source link

  • Related Posts

    What to expect on Wednesday

    Samsung’s first big launch of 2025 is almost here. Galaxy Unpacked takes place on January 22nd at 1PM ET in San Jose, CA, where Samsung’s “Next Big Thing” (to borrow…

    Best Smoke Detectors for 2025

    When it comes to smoke detection, time is of the essence. So we designed a test that mimics a real-life emergency. The first thing to know is that there are…

    Leave a Reply

    Your email address will not be published. Required fields are marked *