Jan 21, Notes on DeepSeek-R1

type

status

date

slug

summary

Introduction

DeepSeek-AI has open-sourced their deep-thinking model, R1. Having read the paper and tested it myself, I'll share my notes about this new model.

The team's key contributions include:

They directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach led to two models: DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero demonstrates remarkable reasoning capabilities, including self-verification and reflection. For DeepSeek-R1, they developed a pipeline with two RL stages and SFT stages to align with human preferences and enhance the model's non-reasoning capabilities.

The team made a significant discovery: reasoning capabilities can be distilled into smaller models, yielding better performance compared to previous works.

R1-Zero(Only RL)

The team built their models using DeepSeek V3 as the foundation and implemented Group Relative Policy Optimization (GRPO) to enable unsupervised reasoning in text completions. Their rule-based reward system includes two key components:

Accuracy rewards: These measure response correctness.

Formal rewards: These ensure the model encases its thinking process within <thinking> and </thinking> tags.

To train DeepSeek-R1-Zero, the team created a training template as follows

This results in a pass@1 score of 71.0% on AIME 2024, while OpenAI's o1-0912 achieves 74.4. The detailed results are shown below:

It's important to note that using majority voting significantly improves performance—for example, increasing the AIME benchmark score from 71.0% to 86.7%.

A particularly intriguing phenomenon emerged during the training of DeepSeek-R1-Zero: the model learned to allocate more thinking time to problems by reevaluating its initial approaches. Here is an example to illustrate:

However, applying only RL resulted in model responses with poor readability and mixed language usage. The multi-stage approach was developed to address this problem.

R1 (Multi-Stage Approach)

To prevent instability during the initial RL training phase, the team first collected thousands of chain-of-thought examples to fine-tune DeepSeek-V3-Base as their starting point.

They then applied the same large-scale reinforcement learning process used in DeepSeek-R1-Zero, adding a language consistency reward to address the language mixing issue.

After the RL training converged, the team performed rejection sampling and supervised fine-tuning. Using this checkpoint, they gathered training data for the next round—600,000 reasoning examples and 200,000 non-reasoning examples covering writing, role-playing, and other general tasks.

In the final stage, they implemented a second reinforcement learning phase using rule-based rewards to enhance mathematical, coding, and logical reasoning abilities. They also incorporated reward models to better align with human preferences in complex scenarios, ultimately improving the model's helpfulness and safety while strengthening its reasoning capabilities.

The results are shown below:

Distillation

The team directly fine-tuned open-source models like Qwen and Llama using only SFT with 800k samples that described before, without an RL stage. The performance on the benchmark proved impressive:

Unsuccessful Attempts

The team shared their unsuccessful attempts in the paper, demonstrating that Process Reward Model (PRM) and Monte Carlo Tree Search (MCTS) failed to develop effective reasoning models.

Insights

Small models directly fine-tuned with data distilled from DeepSeek-R1 achieved remarkably impressive performance.

MCTS and PRM proved ineffective for building reasoning models.

Large-scale RL can improve a model's reasoning capabilities.