DeepSeek-R1-Zero: Enhanced LLM Reasoning

In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for various applications. One of the most exciting developments in this area is the DeepSeek-R1-Zero model, which leverages reinforcement learning (RL) to enhance reasoning capabilities without relying on supervised fine-tuning (SFT). This blog post will guide you through the intricacies of DeepSeek-R1-Zero, from a beginner’s perspective to more advanced technical details.

Introduction to DeepSeek-R1-Zero

DeepSeek-R1-Zero is a groundbreaking model designed to improve the reasoning abilities of LLMs through pure reinforcement learning. Unlike traditional models that require supervised fine-tuning, DeepSeek-R1-Zero skips this step, allowing it to develop reasoning skills autonomously. This approach not only saves computational resources but also provides insights into how models can learn complex problem-solving strategies from scratch.

Understanding Reinforcement Learning in DeepSeek-R1-Zero

What is Reinforcement Learning?

Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. In the context of DeepSeek-R1-Zero, the model learns to generate better reasoning outputs by optimizing a policy model through an RL algorithm called Group Relative Policy Optimization (GRPO).

Group Relative Policy Optimization (GRPO)

GRPO is the backbone of DeepSeek-R1-Zero’s learning process. It optimizes the policy model, denoted as $π$ , to generate better reasoning outputs by maximizing an objective function. Here’s a detailed breakdown of how GRPO works:

Objective Function

The objective function, $J_{G R P O} (θ)$ , is defined as:

Equation

To ensure the equation renders correctly, let’s break it down:

Expectation and Sampling:
Summation and Minimization:
KL Divergence Term: $β D_{K L} (π_{θ} (O | q) | | π_{r} (O | q))$

E is the expectation over the distribution of questions $P (Q)$ .
${o_{i}}_{i = 1 \dots G}$ are a group of outputs sampled from the old policy $π_{θ_{o l d}}$ given a question $q$ .
$π_{θ} (o | q)$ is the probability of output $o$ given question $q$ under the current policy.
$A_{i}$ is the advantage, representing how much better an action is compared to the average action.
$clip (x, 1 - ϵ, 1 + ϵ)$ is a clipping function that limits the policy update to a certain range.
$β$ is a hyper-parameter, and $D_{K L} (π | | π_{r})$ is the Kullback-Leibler divergence, which measures the difference between the current policy and the reference policy.

Sampling Outputs

For each question $q$ , GRPO samples a group of $G$ outputs $o_{1}, o_{2}, \dots, o_{G}$ from the old policy $π_{θ_{o l d}}$ . This set of outputs allows the algorithm to compare and contrast the different responses the model can generate for a single question.

Advantage Calculation

The advantage $A$ is calculated using the rewards $r_{1}, r_{2}, \dots, r_{G}$ corresponding to the outputs within each group:

$A = \frac{r - mean (r_{1}, r_{2}, \dots, r_{G})}{std (r_{1}, r_{2}, \dots, r_{G})}$

$r$ is the reward associated with a particular output.
$mean (r_{1}, r_{2}, \dots, r_{G})$ is the average reward of all the outputs in the group.
$std (r_{1}, r_{2}, \dots, r_{G})$ is the standard deviation of the rewards within the group.

The advantage quantifies how much better a particular output is compared to the average output in the group, in terms of the reward. It is normalized by the standard deviation to ensure the stability of learning.

Policy Optimization

The policy model $π$ is optimized by maximizing the GRPO objective function. This objective function combines the advantage with the policy ratio between the new and old policies, and includes a clipping function and KL divergence term to stabilize the training process.

Reward System

The reward system provides feedback to the model, indicating the quality of its outputs. DeepSeek-R1-Zero uses a rule-based reward system that consists of two types of rewards:

Accuracy Rewards: These rewards are given when the model provides correct answers. For math problems, the final answer must be in a specified format to allow for automated verification. For coding problems, the correctness of the code is checked using a compiler.
Format Rewards: These rewards are used to incentivize the model to structure its responses with the reasoning process enclosed between and tags. This promotes a consistent structure for the model’s outputs, which can help with interpretability.

Baseline Estimation

Unlike traditional RL algorithms that utilize a critic model to estimate the baseline, GRPO estimates the baseline directly from group scores. This method avoids the use of a large critic model, saving computational resources during training.

Training and Self-Evolution Process

Training Template

To guide the model, a simple template is used. This template instructs the model to first produce a reasoning process and then provide the final answer. The model is not given any specific instructions on how to reason, allowing researchers to observe its natural development through the RL process.

Self-Evolution

As the RL training progresses, DeepSeek-R1-Zero shows consistent improvement in performance. For example, its average pass@1 score on the AIME 2024 benchmark increases from 15.6% to 71.0%. The model learns to spend more time thinking and increases the length of its reasoning process. It develops sophisticated behaviors such as reflection and exploring alternative problem-solving approaches, which emerge through interaction with the RL environment.

Aha Moment

During training, the model experiences an “aha moment” where it learns to rethink its initial approach by allocating more thinking time to the problem. This moment demonstrates the model’s ability to learn advanced problem-solving strategies autonomously through RL.

Key Achievements

Autonomous Learning: DeepSeek-R1-Zero demonstrates that LLMs can develop reasoning skills through pure RL without any supervised data.
Strong Reasoning Capabilities: The model achieves performance levels comparable to OpenAI-o1-0912 on certain benchmarks.
Majority Voting: Using majority voting, its performance on AIME 2024 further improves, exceeding the performance of OpenAI-o1-0912.

Limitations

Despite its impressive capabilities, DeepSeek-R1-Zero faces issues like poor readability and language mixing. Its responses may not always be in a human-friendly format and might include multiple languages in one answer.

Conclusion

DeepSeek-R1-Zero is a revolutionary model that learns to reason through pure reinforcement learning, without any prior training on reasoning tasks. It demonstrates the power of RL to enable models to develop complex problem-solving skills and improve their performance over time. While it has some limitations, the insights gained from DeepSeek-R1-Zero pave the way for future advancements in the field of large language models.

For more you can check out this page. Rejection Sampling in DeepSeek-R1