I read the DeepSeek-R1 paper the day it came out, and I don’t think GRPO is the key to its success. Instead, here’s what truly matters (ranked by importance): 1. Iterative RL and SFT 2. A hybrid reward model—mixing rule-based RM and neural RM for deterministic tasks 3.… https://t.co/navYBL6WhC
— Jiao Sun (@sunjiao123sun_) Jan 28, 2025
from Twitter https://twitter.com/sunjiao123sun_
January 28, 2025 at 01:03AM
via IFTTT
No comments:
Post a Comment