Idk if people noticed but Mixtral-Instruct was trained with Direct Preference Optimization (DPO) My prediction that a DPO variant will replace RLHF is already coming true https://t.co/tLDIeolSOa
— Nora Belrose (@norabelrose) Jan 14, 2024
from Twitter https://twitter.com/norabelrose
January 14, 2024 at 06:21PM
via IFTTT
No comments:
Post a Comment