[Torch FX] INT4 data-free weights compression

Question

[Torch FX] INT4 data-free weights compression

Closed this issue a month ago · 0 comments

🚀 Feature request

INT4 weight compression is widely used to compress LLM models and optimize the model inference. OpenVINO effectively optimizes the inference of models with INT4 weights, which results in significantly faster model inference.

Feature request is to add INT4 weights compression for torch.fx.GraphModule models in nncf.compress_weights to enable the creation of models with INT4 compressed weights and inference them using torch.compile with the OpenVINO backend.

Feature Use Case

import torch
import nncf

# initialize a floating point model
float_model = M().eval()

# program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured result should mostly stay the same
model = capture_pre_autograd_graph(float_model, *example_inputs)

# compress weights
compressed_model = nncf.compress_weights(model, mode=nncf.CompressWeightsMode.INT4_ASYM)

# compile quantized model with OpenVINO backend
compiled_model = torch.compile(compressed_model, backend='openvino')

Are you going to submit a PR?

Yes I'd like to help by submitting a PR!