Part 15
Completed

Quantize a Model

Implemented post-training quantization (PTQ) and quantization-aware training (QAT) for transformers. Explored INT8, INT4, and mixed-precision schemes. Implemented GPTQ, AWQ, and SmoothQuant methods with per-layer and per-channel scaling.

What I Built

Implemented post-training quantization (PTQ) and quantization-aware training (QAT) for transformers. Explored INT8, INT4, and mixed-precision schemes. Implemented GPTQ, AWQ, and SmoothQuant methods with per-layer and per-channel scaling.

Key Concepts

Post-Training QuantizationQuantization-Aware TrainingINT8INT4GPTQAWQSmoothQuantMixed Precision

Architecture

1
Quantization Engine
2
GPTQ Implementer
3
AWQ Adapter
4
SmoothQuant Scaler
5
Precision Manager

Results

INT8 quantization achieves 99% of FP16 quality with 2x speedup. INT4 (GPTQ) achieves 95% quality with 4x speedup. AWQ protects critical weights effectively.

Key Learnings

  • Not all weights are equally important—protecting outliers is crucial
  • Activation quantization is harder than weight quantization
  • Per-channel scaling significantly improves quantization quality

Challenges

  • Handling activation outliers that dominate quantization error
  • Balancing speedup with quality degradation
  • Implementing efficient INT4 kernels