Recently vLLM added AWQ (4bit) support... this open source combination will consume less memory and run faster inference of llama based models like 70B llama-2 or 34B codellama. I also saw better model quality using AWQ over other quantization methods like GPTQ https://t.co/F7NALZUR86
— anton (@abacaj) Sep 26, 2023
from Twitter https://twitter.com/abacaj
September 26, 2023 at 01:31PM
via IFTTT
No comments:
Post a Comment