reading a deepseek paper and stumbled upon a very beautiful formula where they unify SFT and MOST RL TYPES (DPO, PPO, GRPO, etc.) into ONE FORMULA* *that requires additional reward functions to be defined. But the fundamental insight - that all these training methods can be… https://t.co/wSO6UHTySc https://t.co/1ZAFF7Sqz2
— N8 Programs (@N8Programs) Jan 28, 2025
from Twitter https://twitter.com/N8Programs
January 28, 2025 at 05:24AM
via IFTTT
No comments:
Post a Comment