wow lot of miss conception @karpathy sensai replies ppl have, teaching policy to an agent is expensive and inefficient, so we came up with a "hack" of the reward model, which acts on humans behalf, so we only need 1000 samples and RM can extrapolate 1e8 something episodes https://t.co/YrJWyCxWOm https://t.co/KKTLAU7BHa
— Joey (e/λ) (@shxf0072) Nov 30, 2024
from Twitter https://twitter.com/shxf0072
November 30, 2024 at 12:30PM
via IFTTT
No comments:
Post a Comment