Quantization routes¶

Quantized inference uses packed affine weights. The storage object is mlx_lattice.core.QuantizedWeight; operations select quantized execution when that object is passed as the weight.

Packed affine layout¶

Each group stores unsigned integer values plus an affine scale and bias. The logical value used by the kernel is:

\[w_{g,j} = s_g q_{g,j} + b_g,\]

where \(q\) is the packed int4/int8 code, \(s_g\) is the group scale, and \(b_g\) is the group bias.

Supported metadata:

Field	Supported values	Validation
`bits`	`4` or `8`	Other bit widths are rejected.
`group_size`	`32`, `64`, or `128`	Storage channels must be divisible by group size.
Packed dtype	`uint32`	Weight tensor is 3D packed storage.
Scale/bias dtype	`float16` or `float32`	Must match each other and feature dtype at execution.
Layout	`linear`, `kernel_major`, `dense_5d`	Determines how logical weights are reconstructed.

Linear route¶

mlx_lattice.ops.linear detects QuantizedWeight and uses quantized MLX matmul for sparse features. The coordinate set is preserved because linear is a feature-only operation.

Convolution route matrix¶

Route	Predicate	Output order
Direct packed relation conv	Valid packed metadata, feature dtype `float16` or `float32`	Public output order.
Sorted quantized implicit-GEMM	fp16 features, sorted plan present, `K=27`, `C_in,C_out` in `{32,64}`, storage channels equal logical channels, group size no larger than `C_in`, TensorOps capability not unavailable	Computes sorted temporary then reorders.
TensorOps quantized contraction	fp16 features, sorted plan, `K=27`, `C_in,C_out` in `{32,64}`, storage channels equal logical channels, `group_size == C_in`, neural acceleration	Public output order.

The direct packed Metal kernels dispatch by feature dtype and bit width:

Feature dtype	int4 kernel	int8 kernel
`float16`	fp16 × int4	fp16 × int8
`float32`	fp32 × int4	fp32 × int8

Performance interpretation¶

Quantization reduces weight storage, but sparse convolution cost is:

\[T \approx T_{\text{relation}} + T_{\text{gather}} + T_{\text{dequant}} + T_{\text{multiply-accumulate}}.\]

If relation traversal dominates, packed weights may not improve runtime. The largest quantization benefit appears when the operation has enough channel work or matrix-like structure for weight bandwidth/arithmetic to matter.

Numerical comparison¶

Packed affine quantization is approximate. Compare quantized convolution to the dequantized dense contract when validating correctness:

\[\hat{Y} = \operatorname{Conv}(X, \operatorname{dequantize}(Q)).\]

The tests use this contract for pointwise, generic relation, submanifold, target, transposed, generative, and sorted implicit-GEMM quantized routes.

Related API and references¶

Need	Page	Notes
Packed weight container	Quantized weights	`QuantizedWeight`, `quantize_weight`, and `dequantize_weight`.
Quantized modules	Quantized convolution modules	Quantized convolution and linear wrappers.
Convolution route matrix	Convolution routes	How quantized weights interact with sparse relation convolution.
Feature linear route	Feature operations	Quantized sparse-feature linear projection.