We might not even need RL as DPO has shown. | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		storus 14 days ago \| parent \| context \| favorite \| on: Spatial intelligence is AI’s next frontier We might not even need RL as DPO has shown.

programjames 14 days ago [–]

> if you purely use policy optimization, RLHF will be biased towards short horizons

> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact