🤏 Quantization

Updated at 2024-11-17 01:21

Neural networks consist of many parameters, which are frequently stored in 32-bit or 16-bit floating point numbers.

Model	# Parameters
BARD	1600 billion
GPT-4	1500 billion
LLaMA	1200 billion
BLOOM	176 billion
LaMDA	137 billion
PaLM-UL2	20 billion
Mistral 7B	7 billion

However, for many applications, it is possible to use lower precision numbers, such as 8-bit or 4-bit integers. This process is called quantization.

For example, a 65B parameter model using FP32 requires about 242GB of memory. With 4-bit quantization, this could be reduced to around 30GB, making it much more practical to use.

Quantization comes with trade-offs. Quantization improves performance by reducing memory requirement and increasing cache utilization. However, quantization also reduces the precision and the ultimate capabilities of the model.

The overall impact of quantization depends on the specific model and the techniques used to quantize the model.

Quantization is a form of compression. The goal is to reduce the number of bits required to represent a number. This is similar to how we can compress an image by reducing the number of colors used to represent the image.

Calibration is mapping weights to lower precision. The process of quantization involves mapping the model to a lower precision format, such as 16-bit floating point numbers or 8-bit integers. You need to figure out the maximum and minimum values of the weights and activations in the model, and then map these values to the lower precision format.

FP32 = 32-bit floating-point numbers (4,294,967,296 possible values)
INT8 = 8-bit integers (255 possible values)
INT4 = 4-bit integers (15 possible values)

Quantization can be done in two ways:

Post-Training Quantization (PTQ): The weights and activations are quantized to lower precision after training.
Quantization-Aware Training (QAT): The weights and activations are quantized to lower precision during training. More intensive but can lead to better results.

GGUF (GPT-Generated Unified Format) is a file format primarily used for storing and running large language models (LLMs) in an efficient way.

Q4: 4-bit quantization
Q5: 5-bit quantization
Q8: 8-bit quantization