Among our daily and weekly newsletters for the latest updates and exclusive content in the main AI coverage industry. Learn more
Usually, developers focus on lowering time in the inference time – the time between AI receives a prompt and provides an answer – to obtain faster insights.
But when it comes to adversarial strength, Openai researchers say: not very fast. They suggest that increasing time to a model should “think” – Count the calculation time – help build defenses against opponents.
The company uses its own O1-preview and O1-mini models to try this theory, launching various static methods – image-based manipulations, intentionally providing incorrectly Answer math problems, and many models with information (“many- shot jailbreaking”). They immediately measured the success of the attack based on the number of computing the inferences used in the inference.
“We have seen that in many cases, this possibility of decay – always near zero – while calculating time in the inference grows,” the researchers Write a blog post. “Our claim is not that these particular models cannot be broken – we know that this is – but that inference-time-time computing can make good strength for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings for different settings and attack. “
From simple Q / A to complex mathematics
Large language models (LLMS) become more sophisticated and autonomous – in some cases of essence Get computers For people to browse the web, execute the code, make appointments and make other assignments that are autonomy – and while they do, their attacks can be wider and more exposed.
Although the strength of the opponent continues to be a hard problem, with limited progress in solving it, OpenAI researchers are referenced – even if it is more critical as models. create more actions with real-world effects.
“Ensuring modeling agents will work reliably when browsing the web, send emails or uploads to make sure to make sure cars drive self driving without accident , “they wrote to a New Research Paper. “As in case of self-driving cars, an agent passes a wrong email or creates security vulnerability can have many consequences in the real world.”
To test the strength of O1-Mini and O1-Preview, researchers have tested many strategies. First, they investigate the ability of the models to solve simple math problems (basic additions and multiplication) and more complicated from dataset in math (with 12,500 questions from mathematical competitions).
They immediately made “goals” for the enemy: making the output model 42 instead of the right answer; to output the correct answer plus one; or output the correct answer to seven. Using the neural network to graduate, researchers see that increases in “thinking” allowing models to calculate the correct answers.
They also matched the Simpleqa Factuality BenchmarkA dataset of questions that are intended to be difficult to solve models without browsing. Researchers injects adversarial prompts to the web pages browsed by AI and found that, at higher calculation times, they can find conflicts and improve the accuracy of the fact .
Vague nuances
Alternatively, researchers used opponents to contemplate models; Again, more “thinking” weather develops recognition and decrease in error. Finally, they tried the next “misuse prompts” from Strongreject benchmarkDesigned so that the victim’s models need to respond to specific, harmful information. It helps to try to follow the model policy on the content. However, while more time in the inference improved immunity, some prompts have avoided defenses.
Here, researchers call the differences between “unclear” and “unclear” tasks. Math, for example, no doubt is unclear – in every problem x, there is a corresponding fact. However, for more vague tasks such as misuse prompts, “even evaluators often struggle to agree if the output is harmful and / or violates content policies that should be followed by the model, “they are targeted.
For example, if an abusive urge to ask and advise on how to copy not found, it is unclear if an output only gives the overall information about plagiarism in fact enough to support harmful actions.
“In case of unclear tasks, there are settings where the attacker has successfully found ‘lootes,’ and the rate of success is not corrupted by the value calculation time,” agreed the researchers.
Defend against jailbreaking, red-teaming
To make these trials, OpenAI researchers examine different methods of attack.
One is the Many-shot jailbreaking, or exploit the disposition of a model following some shot examples. Enemies “put” the context of many examples, each shows an instance of a successful attack. Models with higher calculation time managed to recognize and simplify it more frequently and successfully.
The soft tokens, on the other hand, allow opponents to maneuver to the vectors of embedding. While increasing inference time has helped here, researchers focus on the need for better protection mechanisms against sophisticated vector-based attacks.
Researchers also made red-teaming attacks on man, with 40 expert testers looking for prompts to get policy violations. Red-teams enforce attacks on five levels of the Inference Time Compute, specifically targeted the erotic and extremist content, poorly behavior and self-harm. To help ensure no biased results, they made blind and randomized testing and also rotated trainers.
In a newer way, researchers make a Language-Model Program (LMP) Adaptive Attack, which imitates the character of Red-team people who trust the ITERATIVE TRIAL AND ERROR. In a process of loop, attackers received feedback on previous failures, then used this info for next tests and quick rephrasing. It continues until they have achieved a successful attack or made 25 changes without any attack.
“Our setup allowed the attacker to match its strategy in many trials, based on defender behavior descriptions in response to each attack,” letter to researchers.
Take advantage of the Time of Inference
In the course of their research, Openai found that attackers are also actively enjoying the time of the inference. One of these methods they call “Think Less” – The opponent’s opponents speak the models of computing, thus increasing their absence of error.
Similarly, they recognize a way of failure to rational models they call “nerd sniping.” As its name suggests, it happens when a model spends a lot of time justification than a task required. With these “outliers” mental chains, the models of essence are trapped in non-productive thinking.
Researchers say: “As ‘think less’ attacks, it is a new method of attack (in) rational models, and one should be considered to ensure that the attackers do not make them notify, or spend their reasoning in non-productive ways. “
Source link