Quantized weights¶

QuantizedWeight stores packed affine int4/int8 weights for inference. The object is a logical weight plus the metadata required to execute it without guessing shape from storage:

\[w_{g,j} = s_g q_{g,j} + b_g.\]

Supported layouts map to the public operations:

Layout	Logical source shape	Used by
`linear`	`(C_out, C_in)`	Sparse-feature linear projections.
`kernel_major`	`(K, C_in, C_out)`	Relation convolution with mapped kernel rows.
`dense_5d`	`(C_out, Kx, Ky, Kz, C_in)`	Public sparse convolution modules and functions.

Related pages¶

Quantized backend routes: Quantization routes
Sparse convolution routes: Convolution routes
Quantized module wrappers: Quantized convolution modules and Quantized feature modules
Feature linear API: Feature operations

class mlx_lattice.core.quantized.QuantizedWeight(weight, scales, biases, group_size, bits, in_channels, out_channels, kernel_size, layout)[source]¶

Bases: object

Packed affine INT4/INT8 inference weight.

The object stores packed uint32 integer codes plus per-group affine scales and biases. Logical values are reconstructed as scale * code + bias by quantized linear and convolution paths.

layout records the logical source shape: linear for (C_out, C_in), kernel_major for (K, C_in, C_out), and dense_5d for (C_out, Kx, Ky, Kz, C_in).

Parameters:

weight (array)
scales (array)
biases (array)
group_size (int)
bits (int)
in_channels (int)
out_channels (int)
kernel_size (tuple[int, int, int])
layout (Literal['linear', 'kernel_major', 'dense_5d'])

weight: array¶

scales: array¶

biases: array¶

group_size: int¶

bits: int¶

in_channels: int¶

out_channels: int¶

kernel_size: tuple[int, int, int]¶

layout: Literal['linear', 'kernel_major', 'dense_5d']¶

property storage_in_channels: int¶

property is_pointwise: bool¶

property nbytes: int¶

mlx_lattice.core.quantized.dequantize_weight(weight)[source]¶

Restore the logical floating-point weight represented by weight.

The returned array uses the original logical layout recorded by weight.layout and slices away any padded storage channels.

Return type:: array
Parameters:: weight (QuantizedWeight)

mlx_lattice.core.quantized.quantize_weight(weight, *, group_size=None, bits=4)[source]¶

Pack a linear or sparse-convolution weight for inference.

Parameters:

weight (array) – Floating float16 or float32 weight. Accepted shapes are (C_out, C_in), (K, C_in, C_out), or (C_out, Kx, Ky, Kz, C_in).
group_size (int | None) – Quantization group size. None chooses 64 for C_in >= 64 and 32 otherwise.
bits (int) – Packed integer width, either 4 or 8.

Return type:

QuantizedWeight

Returns:

A QuantizedWeight containing packed storage and affine metadata. Input channels are padded in storage to the selected group size when needed; logical in_channels remains the original channel count.