Thoughts: Favorite tweets

Wednesday, December 18, 2024

Favorite tweets

Anthropic just dropped an insane new paper. AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed! Here's everything you need to know: 🧵 https://t.co/hXH0G28qqx
— MatthewBerman (@MatthewBerman) Dec 18, 2024

from Twitter https://twitter.com/MatthewBerman

December 18, 2024 at 10:21PM
via IFTTT

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)