Anthropic just dropped an insane new paper. AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed! Here's everything you need to know: 🧵 https://t.co/hXH0G28qqx
— MatthewBerman (@MatthewBerman) Dec 18, 2024
from Twitter https://twitter.com/MatthewBerman
December 18, 2024 at 10:21PM
via IFTTT
No comments:
Post a Comment