Apple presents LazyLLM, which introduces a novel dynamic token pruning method for efficient long-context LLM inference. It can accelerate the prefilling stage of a Llama 2 7B model by 2.34x and maintain high accuracy. Idea: It selectively computes the KV for tokens that are… https://t.co/U4qzNGpvA6 https://t.co/hCWljqGMaO
— elvis (@omarsar0) Jul 22, 2024
from Twitter https://twitter.com/omarsar0
July 22, 2024 at 03:20AM
via IFTTT
No comments:
Post a Comment