DAPO

DAPO: An Open-Source LLM Reinforcement Learning System at Scale 单纯从公式表现形式上来可以看成是PPO与GRPO的融合: PPO的化目标为: $$\\mathcal{J}_{PPO}(\\theta)=\\mathbb{E}{(q, a) \\sim \\mathcal{D}, o_{\\leq t} \\sim \\pi_{\\theta_{old