openvinotoolkit/nncf

[Torch FX] INT4 data-free weights compression

Closed this issue · 0 comments

🚀 Feature request

INT4 weight compression is widely used to compress LLM models and optimize the model inference. OpenVINO effectively optimizes the inference of models with INT4 weights, which results in significantly faster model inference.

Feature request is to add INT4 weights compression for torch.fx.GraphModule models in nncf.compress_weights to enable the creation of models with INT4 compressed weights and inference them using torch.compile with the OpenVINO backend.

Feature Use Case

import torch
import nncf

# initialize a floating point model​
float_model = M().eval()​

# program capture​
# NOTE: this API will be updated to torch.export API in the future,​ but the captured result should mostly stay the same​
model = capture_pre_autograd_graph(float_model, *example_inputs)

# compress weights​
compressed_model = nncf.compress_weights(model, mode=nncf.CompressWeightsMode.INT4_ASYM)

# compile quantized model with OpenVINO bac​kend
compiled_model = torch.compile(compressed_model, backend='openvino')

Are you going to submit a PR?

  • Yes I'd like to help by submitting a PR!