Understanding Rejection Sampling: From Basics to DeepSeek-R1

When dealing with complex probability distributions, directly sampling from them can be challenging. Rejection sampling is a clever statistical technique that allows us to overcome this hurdle. This method is not only foundational in computational statistics but also a crucial component in advanced models like DeepSeek-R1. In this blog, we’ll start with the basics of rejection sampling, progress to its mathematical framework, and ultimately delve into its application in DeepSeek-R1.

What is Rejection Sampling?

Imagine trying to throw darts at a dartboard with an intricate shape. Instead of directly aiming at the complex outline, you aim at a rectangular board that completely covers it. Each dart that lands within the shape is accepted, while others are rejected. This is the essence of rejection sampling.

In mathematical terms, rejection sampling is a method to generate samples from a complex probability distribution, referred to as the target distribution $π (x)$ . To achieve this, we use a simpler proposal distribution $q (x)$ from which we can easily draw samples. Here’s how it works:

Sample from the Proposal Distribution: Generate a candidate sample, $x$ , from $q (x)$ .
Generate a Random Threshold: Draw a uniform random number, $u$ , from ([0, 1]).
Acceptance Probability: Compare $u$ to the ratio $\frac{π (x)}{M \cdot q (x)}$ , where $M$ is a scaling constant ensuring $M \cdot q (x) \geq π (x)$ for all $x$ .
- If $u < \frac{π (x)}{M \cdot q (x)}$ , accept $x$ .
- Otherwise, reject $x$ and repeat.

This simple algorithm ensures that the accepted samples follow the target distribution, even though they are initially drawn from the proposal distribution.

Rejection Sampling

Key Components of Rejection Sampling

Target Distribution $π (x)$ : The complex distribution we want to sample from.
Proposal Distribution $q (x)$ : A simpler distribution that approximates the target.
Scaling Factor $M$ : A constant such that $M \cdot q (x)$ covers $π (x)$ everywhere.
Acceptance Probability: Determines whether a sample is kept based on how well it aligns with the target distribution.

Why Rejection Sampling Works

The magic lies in the acceptance probability. Samples where the target distribution is high relative to the proposal distribution are more likely to be accepted, creating a set of samples that mimic the target distribution.

Efficiency Considerations

The efficiency of rejection sampling depends on how closely the proposal distribution matches the target distribution. A poor choice of $q (x)$ or an overly large $M$ can lead to many rejections, making the method computationally expensive.

Bridging to DeepSeek-R1

Rejection sampling plays a pivotal role in the DeepSeek-R1 model, a cutting-edge framework designed for optimizing response generation in language models. Before diving into its specifics, let’s revisit the basics in the context of machine learning.

In DeepSeek-R1, the goal is to align the generated responses with a reward-maximizing optimal policy. Here’s how rejection sampling is adapted to achieve this:

Target Distribution $π_{r ψ} (y | x)$ : Represents the optimal policy derived from a reward function $r ψ$ , which scores how well responses align with desired outcomes.
Proposal Distribution $π_{s f t} (y | x)$ : Represents a simpler, supervised fine-tuned (SFT) policy that serves as the starting point.

The Rejection Sampling Pipeline in DeepSeek-R1

The process starts with generating response candidates $y$ from the SFT policy $π_{s f t} (y | x)$ . These candidates are evaluated using the reward function $r ψ$ , and rejection sampling decides which responses to accept based on their alignment with the optimal policy $π_{r ψ} (y | x)$ .

Mathematical Steps

Sampling: Generate response candidates from $π_{s f t} (y | x)$ .
Reward Scoring: Use a reward function to assign scores to each candidate.
Acceptance Probability: Compute the ratio $\frac{π_{r ψ} (y | x)}{M \cdot π_{s f t} (y | x)}$ , where $M$ ensures the proposal distribution sufficiently covers the target.
Acceptance Decision: Accept candidates with a probability proportional to their alignment with the target policy.

Algorithmic Enhancements in DeepSeek-R1

DeepSeek-R1 implements a refined version of rejection sampling to improve efficiency and applicability:

Pairwise Reward-Ranking Model: Instead of directly calculating $r ψ$ , a ranking model evaluates pairs of responses to derive relative preferences.
Iterative Sampling: Samples are drawn iteratively, and accepted candidates are excluded from subsequent iterations, enhancing diversity.
Hyperparameter Tuning $β$ : A parameter controlling the trade-off between exploiting high-reward samples and exploring diverse responses.

Theoretical Foundation

The method’s statistical correctness is ensured by the expected acceptance rate:

$E_{y \sim π_{s f t}} [\exp ((r ψ (x, y) - r_{m a x}) / β)],$

where $r_{m a x}$ is the maximum reward among unaccepted candidates. By tuning $β$ , DeepSeek-R1 strikes a balance between prioritizing high-reward responses and maintaining alignment with the SFT policy.

Why It Matters

Rejection sampling in DeepSeek-R1 enhances the model’s ability to generate responses that are not only accurate but also align with human preferences and rewards. By iteratively refining response selection, the method ensures higher-quality outputs compared to simpler sampling methods.

Final Thoughts

Rejection sampling is a deceptively simple yet powerful tool in the arsenal of computational statistics and machine learning. From sampling complex distributions to optimizing response generation in models like DeepSeek-R1, its applications are vast and impactful. Whether you’re a beginner trying to grasp the basics or a technical expert delving into advanced implementations, understanding rejection sampling is a valuable skill.

FAQs

Can rejection sampling be used for any distribution? Yes, as long as you have a proposal distribution that adequately covers the target distribution.
What are the limitations of rejection sampling? Its efficiency heavily depends on the choice of proposal distribution and scaling factor.
How does DeepSeek-R1 improve upon standard rejection sampling? By introducing iterative sampling, reward ranking, and a tunable hyperparameter $β$ , it tailors the method for language model optimization.

References:

Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P. J., & Liu, J. (2023). Statistical Rejection Sampling Improves Preference Optimization. ArXiv
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., . . . Zhang, Z. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. ArXiv

Rejection Sampling in DeepSeek-R1