[Preprint] Tequila: Trapping-free Ternary Quantization for Large Language Models
Published:
This paper proposes Tequila, a trapping-free Ternary quantization method for large language models. The key idea in Tequila is to reactivate dead weights by repurposing them as dynamic biases. Tequila achieves a > 4% accuracy gain over the SOTA baseline on the ARC benchmark, nearly matching full-precision performance (within < 1% gap). Furthermore, it delivers a significant 3× inference speedup on an Intel 8263C CPU, verifying that Tequila fully preserves the hardware efficiency of ternary quantization.
