Direct Nash Optimization (DNO) better than DPO for RLHF/RLAIF? 🤔 DNO is a new RLHF method from @Microsoft Research using a batched on-policy algorithm that conducts self-improvement iteratively via contrastive learning. 👀 Implementation: 0️⃣ Start with an SFT LLM (☸️), e.g.,… https://t.co/2sSe6mTyHI https://t.co/NLSS6YuFOA
— Philipp Schmid (@_philschmid) Apr 28, 2024
from Twitter https://twitter.com/_philschmid
April 28, 2024 at 08:16AM
via IFTTT
No comments:
Post a Comment