Open-source DeepSeek-R1 uses pure reinforcement learning to match OpenAI o1 — at 95% lower cost


Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more


Getting started with Chinese AI DeepSeekknown for challenging the main AI vendors with open-source technologies, just dropped another bombshell: a new open reasoning LLM called DeepSeek-R1.

Based on the recently introduced DeepSeek V3 mixture-of-experts model, DeepSeek-R1 matches the performance of o1, OpenAI’s frontier reasoning LLM, on math, coding and reasoning tasks. The best part? It is made at a very tempting cost, which proves to be 90-95% cheaper than the latter.

The release marks a major leap forward in the open-source arena. It shows that open models are increasingly closing the gap with closed commercial models in the artificial general intelligence (AGI) race. To demonstrate the agility of its work, DeepSeek also used the R1 to refine six Llama and Qwen models, bringing their performance to a new level. In one case, the distilled version of the Qwen-1.5B outperformed the larger models, GPT-4o and Claude 3.5 Sonnet, in selected math benchmarks.

These distilled models, along with the main R1is open-sourced and available at Face Hugs under the MIT license.

What does DeepSeek-R1 bring to the table?

The focus is on sharpening artificial general intelligence (AGI), a level of AI that can perform intellectual tasks like humans. Many teams are doubling down on improving the reasoning capabilities of the models. OpenAI made the first notable move in the domain with this o1 modelwhich uses a chain of thought reasoning process to solve a problem. Through RL (reinforcement learning, or reward-driven optimization), o1 learns to refine its chain of thought and refine the strategies it uses — ultimately learning to recognize and correct those it’s wrong, or trying new methods when the current ones don’t work anymore.

Today, continuing the work in this direction, DeepSeek released DeepSeek-R1, which uses a combination of RL and managed fine tuning to handle complex reasoning tasks and match the performance of o1.

When tested, the DeepSeek-R1 scored 79.8% on the AIME 2024 math tests and 97.3% on the MATH-500. It also achieved a 2,029 rating on Codeforces — better than 96.3% of human programmers. In contrast, the o1-1217 scored 79.2%, 96.4% and 96.6% respectively in these benchmarks.

It also shows strong general knowledge, with 90.8% MMLU accuracy, behind o1’s 91.8%.

Performance of DeepSeek-R1 against OpenAI o1 and o1-mini

The training pipeline

DeepSeek-R1’s reasoning performance marks a big win for the Chinese startup in the US-dominated AI space, especially since the entire work is open source, including how the company trained the whole thing. .

However, the job is not as straightforward as it sounds.

According to the paper describing the research, DeepSeek-R1 was developed as an improved version of DeepSeek-R1-Zero – a collapse model trained only from reinforcement learning.

The company first used DeepSeek-V3-base as a base model, developing reasoning capabilities without using managed data, essentially focusing only on its own evolution through a pure testing process-and -error based on RL. Developed intrinsically from the work, this ability ensures that the model can solve more complex reasoning tasks by using extended computation test time to explore and refine thought processes in greater depth.

“During training, DeepSeek-R1-Zero naturally emerged with very strong and interesting behavioral reasoning,” the researchers said in the paper. “After thousands of RL steps, DeepSeek-R1-Zero showed the best performance in reasoning benchmarks. For example, the pass@1 score of AIME 2024 increased from 15.6% to 71.0% , and with majority voting, the score further increases to 86.7%, which is equivalent to the performance of OpenAI-o1-0912.

However, despite showing improved performance, including behaviors such as reflection and exploring alternatives, the first model showed some problems, including poor reading and speech integration. To fix this, the company builds on the work done for R1-Zero, using a multi-stage approach that combines both supervised learning and reinforcement learning, and thus creates the improved model of R1.

“In particular, we will start by collecting thousands of cold start data to fine-tune the DeepSeek-V3-Base model,” the researchers explained. “After this, we make RL based on reasoning like DeepSeek-R1-Zero. With the near convergence of the RL process, we make new SFT data by rejecting the sampling of the RL checkpoint, combined with the supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then training the DeepSeek-V3 -Base model.After refining the new data, the checkpoint goes through an additional RL process, taking into account the prompts from all scenarios. After these steps, we get a checkpoint called DeepSeek-R1, which achieves performance similar to OpenAI-o1-1217.

Cheaper than o1

In addition to improved performance that almost matches OpenAI’s o1 in benchmarks, the new DeepSeek-R1 is also very affordable. Specifically, where OpenAI o1 costs $15 per million input tokens and $60 per million output tokens, DeepSeek Reasoner, which is based on the R1 model, cost $0.55 per million input and $2.19 per million output tokens.

The model can be tested as “DeepThink” on DeepSeek chat platformwhich is similar to ChatGPT. Interested users can access the model weights and code repository through Hugging Face, under the MIT license, or can go with the API for direct integration.



Source link
  • Related Posts

    The Donald Trump 2.0 Grift Is Here

    President Donald said Trump sworn in as the 47th President of the United States on Monday, grifters and opportunists are already trying to cash in on Trump’s new era—including the…

    US safety regulators are expanding Ford hands-free driving tech investigation

    A US federal safety regulator has “upgraded” its investigation into Ford’s hands-free advanced driver assistance system known as BlueCruise – a necessary step before a recall can be issued. .…

    Leave a Reply

    Your email address will not be published. Required fields are marked *