No, DPO avoids a Reinforcement Learning training loop. For the current iteration on verifiable domains, our method is GRPO.
Let me elaborate: DPO is for preference learning - each data sample in the dataset contains 2 pieces: preferred and non-preferred responses (what the model should avoid generating). DPO optimizes for the preferred response between the 2. That means, DPO is one effective method for making a model learn sentiment or preference. We call a generalization of this alignment mode - it's on our roadmap.
On the current GRPO implementation side, dataset needs on Augento are simpler: Just the prompt, and some captured context if you like - it's then the reward function that scores the model generations.
Currently, with GRPO, training is done on verifiable domains. Not preference, but one piece of output will be judged by a deterministic reward function, or by a reward model (which the user decides - you can decide it through defining the reward function).
(EDIT: Would you use DPO? Do you have experience with it or needs?)
To add, there is the important distinction to be made between RLHF (Reinforcement Learning with Human Feedback) and RL. DPO is a simpler and more efficient way to do RLHF. In its current iteration, Augento does RL (using the term coined by OpenAI: Reinforcement Fine-tuning) which improves model performance on domains where there exists a verification function for the answer that you can use for scoring, rather than a preferred answer such as DPO needs.
But as said, such preference mode is on the roadmap.
(EDIT: Would you use DPO? Do you have experience with it or needs?)