Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more
As large language models (LLMs) continue to improve coding, the benchmarks used to evaluate their performance continue to become less useful.
That’s because even with many LLMs having equally high scores on these criteria, understanding what to use in specific software development and business projects can be difficult.
A new paper by Yale University and Tsinghua University presents a novel method to test the ability of models to solve “doing its own code generation” problems that require reasoning, code generation, and reusing existing code to solve the problem.
Creating your own code more closely resembles realistic programming scenarios and provides a better understanding of today’s LLMs’ ability to solve real-world coding problems.
Generate your own code
Two popular benchmarks used to evaluate the coding abilities of LLMs HumanEval and MBPP (Most Basic Python Problems). These are datasets of hand-crafted problems that require the model to write code for simple tasks.
However, these benchmarks cover only a subset of the challenges that software developers face in the real world. In practical scenarios, software developers don’t just write new code—they also have to understand and reuse existing code and create usable components to solve complex problems.
“The ability to understand and then use one’s own generated code, that is to generate one’s own code, plays an important role for LLMs to use their reasoning abilities to generate code that is not captured by benchmarks ,” the researchers wrote.
To test the ability of LLMs to generate their own code, the researchers created two new benchmarks, HumanEval Pro and MBPP Prowhich expands existing datasets. Each problem in HumanEval Pro and MBPP Pro builds on top of an existing instance of the original dataset and introduces additional elements that require the model to solve the base problem and invoke the solution to solve more complex ones. problem.
For example, the original problem may be something simple, such as writing a function that replaces all occurrences of a given character in a string with a new character.
The extended problem is to write a function that converts the occurrences of multiple characters in a string to their given replacements. This requires the model to write a new function that calls the previous function it created in the simple problem.
“This self-invoking code generation evaluation offers deeper insight into the programming capabilities of LLMs, which go beyond the scope of code generation as a problem,” the researchers wrote. .
LLMs don’t do well in creating their own code
Researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including GPT-4oOpenAI o1-small, Claude 3.5 Sonnetas well as Qwen, DeepSeek, and Codestral series.
Their findings show a significant difference between traditional coding benchmarks and self-referencing code generation tasks. “While frontier LLMs excel at generating individual code snippets, they often struggle to effectively use their own generated code for solving more complex problems,” the researchers wrote.
For example, in one generation (pass@1), o1-mini achieved 96.2% in HumanEval but only 76.2% in HumanEval Pro.
Another interesting finding is that while teaching fine-tuning provides significant improvements in simple coding tasks, it shows diminishing returns in self-invoking code generation. The researchers noted that “current instruction-based repair methods are not effective enough for more complex self-referencing code generation tasks,” suggesting that we should rethink whether how we train base models for coding and reasoning tasks.
To help advance research on self-calling code generation, researchers have proposed a technique to automatically reuse existing coding benchmarks for self-calling code generation. The method uses LLM boundaries to generate self-problems based on the original problems. They develop candidate solutions and verify their correctness by implementing the code and running its test cases. The pipeline reduces the need for manual code review to help generate more examples with less effort.
A complex landscape
This new family of benchmarks comes at a time when the old coding benchmarks are quickly being subsumed by frontier models. Current frontier models such as the GPT-4o, o1, and Claude 3.5 Sonnet have very high HumanEval and MBPP scores as well as their more advanced versions, HumanEval+ and MBPP+.
At the same time, there are more complex benchmarks such as SWE-Benchwhich evaluates the capabilities of models in end-to-end software engineering tasks that require a wide range of skills such as the use of external libraries and files, and management of DevOps tools. SWE-Bench is a very difficult benchmark and even the most advanced models show mediocre performance. For example, OpenAI o1 is not compatible with SWE-Bench Verified.
Creating the self-referencing code sits somewhere between simple benchmarks and SWE-Bench. It helps evaluate a specific type of reasoning ability: using existing code within a module to solve complex problems. Self-invoking code benchmarks may prove to be a very practical proxy for the usefulness of LLMs in real-world settings, where human programmers are in control and AI copilots help them accomplish specific coding tasks in software development process.
“HumanEval Pro and MBPP Pro are positioned to serve as important benchmarks for code-related evaluations and to encourage future LLM development by shedding light on current model deficiencies and encouraging innovation -ohan training methods,” the researchers wrote.
Source link