We made distillation and spec decoding work with Mamba (and linear RNNs in general)! Up to 300 tok/sec for 7B🚀. Spec dec is nontrivial as there's no KV cache to backtrack if some tokens aren't accepted, but there's an efficient hardware-aware algo to recompute the SSM states https://t.co/hBuyaDm7Ia
— Tri Dao (@tri_dao) Aug 29, 2024
from Twitter https://twitter.com/tri_dao
August 29, 2024 at 12:36AM
via IFTTT
No comments:
Post a Comment