You can now distill pretrained Transformers to Mamba / hybrid architecture to get really strong models with fast inference in just a few billion tokens. Beautiful math as always https://t.co/ZQiazasmwg
— Tri Dao (@tri_dao) Aug 22, 2024
from Twitter https://twitter.com/tri_dao
August 22, 2024 at 08:06PM
via IFTTT
No comments:
Post a Comment