In this post, I’ll share my learnings from building basic but powerful post-training pipelines from scratch, and improving the quality of a relatively small model (Qwen2.5-Math-1.5B) in math tasks. This is done as part of Assignment 5 of the famous Stanford CS 336. The learnings might be rudimentary to veterans in the field but might help those who have just started on the journey (back) into research like me.

The keywords to highlight for this practice and learning process are “from scratch”. Though in principle, I could have done pretraining using the model I built myself (model implementation and pre-training pipelines, FlashAttention-2 kernel implemented using Triton, bucketed and async DDP), actually running the training process would cost too much for my hobby budget. So following the suggestions from the class, here are my starting point:

Qwen2.5-Math-1.5B as the base model.
transformer package from HuggingFace.
vLLM for running inference efficiently.
Flash-attn package.

All code is shared on GitHub in the following repo:

djwenren

assignment5-alignment

Waiting for api.github.com...

00K

Waiting...

and all erexperiments log here.

Data#

The official assignment for those enrolled in the class uses the MATH data set, but due to copyright reasons, the staff was not able to share the data set outside the class. Instead they recommended a few alternative options. I used the following two data sets following their suggestions:

Countdown. Each example from the data set is provided as a list of integers and a target integers. The task is to use the basic arithmetic operations ( $+$ , $-$ , $\times$ , $/$ , and parentheses " $()$ ") and the numbers provided to reach the target number. Each input integer needs to be used exactly once, but the operations can be as many times as needed. There are also benchmarks for various RL algorithms online that one can compare with.
GSM8k. Pretty standard data set for grade school level math problems from OpenAI. Questions and answers, including the reasoning processes in the answers are provided. This will be very useful for running SFT.

Because the assignment starter code is for using the MATH data set, there is no grader appropriate for these two datasets. It is a good exercise to implement the data processing pipeline, customer graders and eval pipelines for the two data sets.

System Prompt#

Though the base model already has some thinking (CoT) capability, it doesn’t seem to follow the output format instruction very well, in particular, the one in the R1 Zero paper:

1
A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
2
User: {question}
3
Assistant: <think>

I found it slightly better to spell out the instruction even more explicitly, though it’s a relatively minor improvement

1
You are a helpful Assistant having a conversation with a User. The user asks you a question, and your job is to solve it.
2

3
You will first think about the reasoning process step by step and then provide the User with the final answer. Your reasoning process should be enclosed within the XML tags `<think>` and `</think>`, and the final answer within the XML tags `<answer>` and `</answer>`. In other words, format your response in the following way:
4

5
```
6
<think>
7
Your thinking process goes here
8
</think>
9
<answer> Your final answer goes here </answer>
10
```
11

12
Here is the Conversation:
13
User: {question}
14
Assistant:
15
<think>

NOTE
It is possible to force the output format using Finite State Transducer (FST, example intro) constraint on the model response, but I couldn’t find any available open source implementation of this, so I’ll have to rely on the model’s instruction following capability for now.

SFT#

For the SFT part, I used the GSM8k data set. The Countdown task isn’t suitable for this task because it doesn’t give a ground truth answer or reasoning process, and instead gives a deterministic way to verify the answer. This makes it an excellent task for RL training with verifiable reward. I’ll revisit this in the RL section.

The SFT part is pretty straightforward and similar to the pretraining part. The only major difference is that the loss is computed based on the model response part, not the whole sequence. See the training pipeline here.

GSM8K SFT result#

Because the base model is already trained on math tasks, its baseline performance is already at 12.5%. It also learns pretty quickly. Using one A6000, its performance quickly reaches 81.25% within one hour.

GSM8K SFT result

Here is an example where the model answers correctly:

1
You are an helpful Assistant having a conversation with a User. The user asks you a question, and your job is to solve it.
2

3
You will first think about the reasoning process step by step and then provides the User with the final answer. Your reasoning process should be enclosed within the XML tags `<think>` and `</think>`, and the finnal answer within the XML tags `<answer>` and `</answer>`. In other words, format your response in the following way:
4

5
```
6
<think>
7
Your thinking process goes here
8
</think>
9
<answer> Your final answer goes here </answer>
10
```
11

12
Here is the Conversation:
13
User: The teacher assigned a minimum of 25 pages of reading for homework.  Harrison read 10 more pages than assigned.  Pam read 15 more pages than Harrison and Sam read twice the amount of Pam.  How many pages did Sam read?
14
 Please give the final answer as a single number.
15
Assistant:
16
<think>

1
The teacher assigned 25 pages, and Harrison read 10 more pages than that, so Harrison read 25+10=<<25+10=35>>35 pages.
2
Pam read 15 more pages than Harrison, who read 35 pages, so Pam read 15+35=<<15+35=50>>50 pages.
3
Sam read twice the amount of Pam, who read 50 pages, so Sam read 50*2=<<50*2=100>>100 pages.
4
 </think> <answer> 100.0 </answer>

For experiment details and training curves, see here.

Expert Iteration#

In Expert Iteration, instead of training on ground truth responses, we use the model’s own responses that lead to correct answers. The algorithm is schematically as the following

\begin{algorithm}
\caption{Expert Iteration (EI)}
\begin{algorithmic}
\INPUT{initial policy model $\pi_{\theta_{\textrm{init}}}$; reward function $R$; task questions $\mathcal{D}$}
\STATE policy model $\pi_{\theta} \leftarrow \pi_{\theta_{\textrm{init}}}$
\FOR{step $ = 1$ \TO $n_{\mathrm{ei\_steps}}$}
    \STATE Sample a batch of questions $\mathcal{D}_b$ from $\mathcal{D}$
    \STATE Set the old policy model $\pi_{\theta_{\mathrm{old}}} \leftarrow \pi_{\theta}$
    \STATE Sample $G$ outputs $\{o^{(i)}\}^G_{i=1} \sim \pi_{\theta_{\mathrm{old}}}(\cdot | q)$ for each question $q \in \mathcal{D}_b$
    \STATE Compute rewards $\{r^{(i)}\}^G_{i=1}$ for each sampled output $o^{(i)}$ by running reward function $R(q, o^{(i)})$
    \STATE Filter out wrong outputs (i.e., $o^{(i)}$ with $r^{(i)} = 0$) to obtain a dataset $\mathcal{D}_{\mathrm{sft}}$ of correct question-response pairs
    \STATE Run SFT $\pi_{\theta} \leftarrow \mathrm{SFT}(\pi_{\theta}, \mathcal{D}_{\mathrm{sft}})$
\ENDFOR
\OUTPUT{$\pi_{\theta}$}
\end{algorithmic}
\end{algorithm}

In practice, I also noticed that the training could also suffer from a problem where during each expert iteration, the model’s eval score can plateau quickly after three or four epochs, and after three or four expert iteration, the model performance starts collapsing. My theory for this is that for each expert batch, there is a very limited amount of useful information, so after three or four epochs, the model performance plateaus.

In each new expert iteration, the training labels are sampled from a different model than the previous iterations. The key issue seems to be the optimizer momenta. In particular, I used Adam. The momenta accumulated from previously expert iterations can point to directions that are inconsistent with the new label distribution, leading to a huge amount of training noise, and thus collapsing the training. My solution for this is to use a fresh optimizer for each expert iteration. The training then becomes stable and the eval score curve shows jumps at the beginning of each expert iteration, and then plateaus. For this reason, I set the number of epochs per expert iteration to 4.

I also need to lower the learning rate to 2e-5 to make the training stable.

The eval score curve looks as following

GSM8K EI result

NOTE
The training was stopped early as a cost saving measure because the trend is already pretty consistent. SFT and RL with verified reward are more interesting and can use the compute resource more efficiently.

For the experiment details, see here. Here is an example of the correct answer

1
You are an helpful Assistant having a conversation with a User. The user asks you a question, and your job is to solve it.
2

3
You will first think about the reasoning process step by step and then provides the User with the final answer. Your reasoning process should be enclosed within the XML tags `<think>` and `</think>`, and the finnal answer within the XML tags `<answer>` and `</answer>`. In other words, format your response in the following way:
4

5
```
6
<think>
7
Your thinking process goes here
8
</think>
9
<answer> Your final answer goes here </answer>
10
```
11

12
Here is the Conversation:
13
User: Frank spent 1/5 of his money to buy groceries. He then spent 1/4 of the remaining money to buy a magazine. If he had $360 left in his wallet, how much money did he have at first?
14
 Please give the final answer as a single number.
15
Assistant:
16
<think>

1
1. Initial amount of money = x
2
2. After buying groceries, money remaining = (4/5)x
3
3. After buying magazine, money remaining = (3/4)(4/5)x = (3/5)x
4
4. Given that (3/5)x = 360
5
5. Solving for x, we get
6
x = 360 * (5/3) = 600</think>
7
<answer> 600 </answer>

RL With Verifiable Reward#

This is the most interesting post-training method discussed in this post. For this post I’ll focus on how to get the training working and some of my understanding of the training dynamics, and skip the theory derivation part, such as how to go from policy gradient definition, to PPO, to DPO, and GRPO. I might write another post for this in the future, but a good starting point is the policy gradient primer in CS 336.

Policy collapse#

This is the main challenge of post-training with RL. This refers to the problem where the model performance (eval scores) suddenly starts collapsing after initially making good progress, such as the following examples.

Countdown policy collapse Policy collapse with Countdown using GRPO. Experiment details. GSM8K policy collapse Policy collapse with GSM8K using GRPO. Experiment details..

Notice how the policy collapses after spending some time at the performance plateau.

It is typically accompanied with spikes in KL divergence between the old policy used for sampling rollouts and the latest policy under training, the gradient norm, and the clip fraction in the GRPO loss function.

Approximate KL divergence for policy collapse Approximate KL divergence when policy collapse happens. $Clip fraction for policy collapse$ Clip fraction when policy collapse happens. Gradient norm for policy collapse Gradient norm when policy collapse happens.

My intuition for the collapse is as follows. Initially, there are a lot of easy patterns for the model to learn, so the signal to noise ratio is pretty high and learning is relatively stable. Once the model reaches a performance plateau, the meaningful learning signal becomes weaker and weaker and noise can dominate. The noise can often overwhelm the signal if not managed carefully, as demonstrated by the spikes in gradient norm and (approximate) KL divergence.

In the RL training presented here, there are two main sources of noise.

Training goes too off-policy#

Off-policy refers to the training regime where the model/policy is different from the policy used to sample the training examples, aka the rollout batch. One might wonder why the training needs to go off-policy in the first place. After all, if we stay on-policy, the training would be a lot easier and stable. I think there are mainly two reasons:

Sampling rollouts are expensive due to the decoding/generation stage of the LLM inference. Unlike training or the prefill stage, this stage is almost always memory bandwidth bound in the hardware and hyperparameter regime I operate at (inference on one H100). We want to make better use of the generated responses.
For hard tasks like Countdown, the high reward rollouts (stronger learning signals) are hard to come by initially. Having multiple training steps on them means better rollout usage and training performance.

But obviously, there are great disadvantages to off-policy training too:

When too off-policy, the ratio between the probabilities of the latest policy and the old policy $\pi_{\theta} / \pi_{\mathrm{old}}$ can get too big and end up getting clipped in the GRPO loss function, which leads to no gradient and no training.
When too off-policy, the ratio, which is the importance sampling factor, can get very noisy. This can make the training very unstable and collapse.

To address this noise source, one needs to carefully monitor the KL divergence and clip ratio. I also tried some heuristics for controlling how far off-policy the training can go. For example, I tried to put a threshold on KL divergence. If the training in one GRPO step reaches that threshold, that GRPO step will be stopped early. However, in practice, I noticed this didn’t really help much. Once the training becomes unstable, it almost always only takes one training step for the model to reach the KL divergence threshold, and stops the training there, effectively making the training on-policy. And at that point, even if the training stays on-policy, the policy will still collapse.

In the original DeepSeek V3 and R1 papers, there is an extra KL divergence term in the GRPO loss function, controlling how off-policy the model can get. However, it seems to have been dropped in most recent models in literature and practice. The rationale seems to be that we do want the model to move significantly away from the original model by self-improving using verifiable rewards. And the training can also be stabilized using other methods as I show below (tuning epochs, batch sizes, and learning rate schedule).

In my experiments, the most effective ways of avoiding policy collapse due to too off-policy is by controlling the number of epochs per rollout batch, and the number of training steps per epoch. For simple tasks like GSM8k, there doesn’t seem much need for the training to go too off-policy, since the model already has a fairly good success rate. In other words, the training already has plenty of good training signals and there is not much need to repeat on any particular rollouts. In my experiment, I set 2 epochs per rollout and one training step per rollout (using gradient accumulation). The training can already progress pretty smoothly.

GSM8K GRPO GRPO eval reward curve for GSM8K.

The training above is based off the checkpoint from the expert iteration training, and took about 35 minutes on two H100 to go from 73% to 82%. No learning rate schedule is needed either. More details about the experiment can be found here. Here is an example question and model response

1
You are an helpful Assistant having a conversation with a User. The user asks you a question, and your job is to solve it.
2

3
You will first think about the reasoning process step by step and then provides the User with the final answer. Your reasoning process should be enclosed within the XML tags `<think>` and `</think>`, and the finnal answer within the XML tags `<answer>` and `</answer>`. In other words, format your response in the following way:
4

5
```
6
<think>
7
Your thinking process goes here
8
</think>
9
<answer> Your final answer goes here </answer>
10
```
11

12
Here is the Conversation:
13
User: In a dance class of 20 students, 20% enrolled in contemporary dance, 25% of the remaining enrolled in jazz dance, and the rest enrolled in hip-hop dance. What percentage of the entire students enrolled in hip-hop dance?
14
 Please give the final answer as a single number.
15
Assistant:
16
<think>

1
First, calculate the number of students enrolled in contemporary dance:
2
20% of 20 students = 0.20 * 20 = 4 students
3
Next, calculate the number of students remaining after accounting for the students in contemporary dance:
4
20 - 4 = 16 students
5
Now, calculate the number of students enrolled in jazz dance:
6
25% of 16 students = 0.25 * 16 = 4 students
7
Next, calculate the number of students remaining after accounting for the students in jazz dance:
8
16 - 4 = 12 students
9
These remaining students are enrolled in hip-hop dance:
10
12 students
11
Finally, calculate the percentage of students enrolled in hip-hop dance:
12
(12 / 20) * 100 = 60%
13
</think>
14
<answer> 60 </answer>

As we can see, the model’s CoT is much more clearly explained and logical.

For harder tasks like Countdown, however, going off-policy is necessary. The training performance starts pretty bad (about 0.7%). Therefore it would be very helpful if there could be more training steps on successful examples. This means the training becomes less stable, especially later in the training run when the performance is about to or has already plateaued.

Performance already plateaued#

In this stage, most of the gradients the model receives during training are from noise. If the learning rate is still too high, it can very often take too big a step in the noise direction, which then leads to policy collapse. The most effective way of controlling this that I learned from these experiments is to reduce the learning rate. One wants to reduce the learning rate enough to balance out the increased noise-to-signal ratio. I found typically the learning rate needs to be at most about 10% of the max learning rate. A cosine learning rate schedule worked for me well.

Another thing I noticed is that increasing the microbatch size also seems to reduce noise, though it probably doesn’t make much intuitive sense. Even with a microbatch size of 1 or 2, the training is probably already compute bound, so increasing microbatch size is probably not from a system perspective. On one H100, it seems the maximum microbatch size is 4 for Countdown. I started with 2 and noticed that going to 4 makes the training more stable, but I didn’t dig deep in this direction. It would be interesting to study this a bit more in the future.

Here is the learning curve for one RL run using Countdown

Countdown GRPO GRPO eval curve for Countdown.

It takes less than an hour to go from 0.6% to 51.4% on two H100, and then stabilizes there. As we can see, this is pretty much similar to the benchmarks posted by the person who created the dataset. Throughout the training run, all training log metrics stays stable, such as KL divergence and gradient norm

Approximate KL divergence during stable training. $Clip fraction during stable training$ Clip fraction during stable training. Gradient norm during stable training.

For more details of this experiment, see here.

Response length and group reward standard deviation normalization#

It was argued in the Dr. GRPO paper that the response length normalization and the group reward standard deviation normalization (z-scaling) made the GRPO loss function no longer an unbiased estimator of the (negative) policy gradient objective. I ran an experiment using Countdown that incorporated the suggestions from that paper:

Instead of normalizing rewards by model response length, normalize them by max response length.
Remove the normalization using group reward standard deviation.

What I found is that it doesn’t make much difference in model performance, with the model trained without the normalizations having slightly worse performance. Here is the eval curves comparison

Eval score comparison GRPO vs Dr. GRPO Eval score comparison between GRPO and Dr. GRPO.

Another interesting thing to note is that in terms of model response length, the model trained without normalization (Dr. GRPO) is consistently worse. Its average response length stabilizes around 500 tokens, while the one with normalization (GRPO) stabilizes around 350 tokens. This makes intuitive sense because in the Dr. GRPO loss function, reward can be improved by simply generating more tokens, while in the GRPO loss function, reward is averaged over the sequence.

Response length comparison GRPO vs Dr. GRPO.

In real world scenarios, GRPO would then be much more preferred since it significantly reduces inference cost.