Part 9
Completed

FLOPs, Memory, Bandwidth, Precision

Built a comprehensive profiling framework analyzing transformer operations across FLOPs, memory footprint, bandwidth utilization, and numerical precision. Studied the impact of mixed precision, quantization, and kernel fusion on end-to-end performance.

What I Built

Built a comprehensive profiling framework analyzing transformer operations across FLOPs, memory footprint, bandwidth utilization, and numerical precision. Studied the impact of mixed precision, quantization, and kernel fusion on end-to-end performance.

Key Concepts

FLOP AnalysisMemory FootprintBandwidth UtilizationMixed PrecisionKernel FusionRoofline Model

Architecture

1
FLOP Counter
2
Memory Profiler
3
Bandwidth Monitor
4
Precision Tester
5
Roofline Analyzer

Results

Identified attention as bandwidth-bound and FFN as compute-bound. BF16 training achieves 1.6x speedup with no quality loss. INT8 inference 3.2x faster.

Key Learnings

  • Not all operations are compute-bound—memory bandwidth is often the bottleneck
  • Mixed precision requires careful handling of gradient scaling
  • Roofline model accurately predicts optimization potential

Challenges

  • Accurate FLOP counting for complex operations
  • Understanding interaction between precision and training stability
  • Modeling memory bandwidth constraints accurately