Tag: quantization

Boost 2-Bit LLM Accuracy with EoRA

Boost 2-Bit LLM Accuracy with EoRA Quantization is one of the key techniques for reducing the memory footprint of large language models (LLMs). It works by converting the data type of model parameters from higher-precision formats such as 32-bit floating point (FP32) or 16-bit floating point (FP16/BF16) to lower-precision integer formats, typically INT8 or INT4.…

May 15, 2025
Weighted quantization using MMD: From mean field to mean shift via gradient flows

Weighted quantization using MMD: From mean field to mean shift via gradient flows arXiv:2502.10600v1 Announce Type: new Abstract: Approximating a probability distribution using a set of particles is a fundamental problem in machine learning and statistics, with applications including clustering and quantization. Formally, we seek a finite weighted mixture of Dirac measures that best approximates…

February 18, 2025
DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations

DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations arXiv:2412.09687v1 Announce Type: cross Abstract: Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained devices. To achieve high accuracy, existing methods for quantizing activations…

December 16, 2024