No, DPO avoids a Reinforcement Learning training loop. For the current iteration...

lukasego · 2025-04-01T11:29:16 1743506956

To add, there is the important distinction to be made between RLHF (Reinforcement Learning with Human Feedback) and RL. DPO is a simpler and more efficient way to do RLHF. In its current iteration, Augento does RL (using the term coined by OpenAI: Reinforcement Fine-tuning) which improves model performance on domains where there exists a verification function for the answer that you can use for scoring, rather than a preferred answer such as DPO needs. But as said, such preference mode is on the roadmap.