Turns out LLMs are paying too much attention. LLMs can run twice as fast by removing redundant attention layers that don't impact performance materially. Removing half the attention layers in models barely affects accuracy but doubles speed. Original Problem 🔍: LLMs exhibit… https://t.co/rCHNuImJkp https://t.co/fSeulfDzDi
— Rohan Paul (@rohanpaul_ai) Nov 7, 2024
from Twitter https://twitter.com/rohanpaul_ai
November 07, 2024 at 01:18PM
via IFTTT
No comments:
Post a Comment