New paper from Salesforce shows that you need to train critique models (for LLM as a judge) on a bunch of different behaviors, and you can do it with DPO. Direct Judge Preference Optimization: Also uses a combined SFT + DPO joint loss and uses Llama 3.1 70B to generate high… https://t.co/hnZ4BpEYSO https://t.co/Fh7pMnilKK
— Nathan Lambert (@natolambert) Sep 30, 2024
from Twitter https://twitter.com/natolambert
September 30, 2024 at 04:21PM
via IFTTT
No comments:
Post a Comment