Quantization routes¶
Quantized inference uses packed affine weights. The storage object is
mlx_lattice.core.QuantizedWeight; operations select quantized execution
when that object is passed as the weight.
Packed affine layout¶
Each group stores unsigned integer values plus an affine scale and bias. The logical value used by the kernel is:
where \(q\) is the packed int4/int8 code, \(s_g\) is the group scale, and \(b_g\) is the group bias.
Supported metadata:
Field |
Supported values |
Validation |
|---|---|---|
|
|
Other bit widths are rejected. |
|
|
Storage channels must be divisible by group size. |
Packed dtype |
|
Weight tensor is 3D packed storage. |
Scale/bias dtype |
|
Must match each other and feature dtype at execution. |
Layout |
|
Determines how logical weights are reconstructed. |
Linear route¶
mlx_lattice.ops.linear detects QuantizedWeight and uses quantized MLX
matmul for sparse features. The coordinate set is preserved because linear is a
feature-only operation.
Convolution route matrix¶
Route |
Predicate |
Output order |
|---|---|---|
Direct packed relation conv |
Valid packed metadata, feature dtype |
Public output order. |
Sorted quantized implicit-GEMM |
fp16 features, sorted plan present, |
Computes sorted temporary then reorders. |
TensorOps quantized contraction |
fp16 features, sorted plan, |
Public output order. |
The direct packed Metal kernels dispatch by feature dtype and bit width:
Feature dtype |
int4 kernel |
int8 kernel |
|---|---|---|
|
fp16 × int4 |
fp16 × int8 |
|
fp32 × int4 |
fp32 × int8 |
Performance interpretation¶
Quantization reduces weight storage, but sparse convolution cost is:
If relation traversal dominates, packed weights may not improve runtime. The largest quantization benefit appears when the operation has enough channel work or matrix-like structure for weight bandwidth/arithmetic to matter.
Numerical comparison¶
Packed affine quantization is approximate. Compare quantized convolution to the dequantized dense contract when validating correctness:
The tests use this contract for pointwise, generic relation, submanifold, target, transposed, generative, and sorted implicit-GEMM quantized routes.