
Can not inference the quantilized model in my device by int8

Opened this issue ยท 1 comments

๐Ÿ‘‰ Please follow one of these issue templates ๐Ÿ‘ˆ

Note: to keep the backlog clean and actionable, issues may be immediately closed if they do not follow one of the above issue templates.
I want to test the accuracy and time consumption of ibert in int8 inference, so I installed transformers and tried to quantize the Roberta-base model to generate the weights. I have set that quant_model is true and torch type is int8. However, the time consumption of ibert in int8 inference is similar to Roberta-base model in my 1080ti device. Is any problem with my config.json or my device? Here is my config.json:
"_name_or_path": "./outputs/checkpoint-1150/",
"architectures": [
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"finetuning_task": "mrpc",
"force_dequant": "none",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": 0,
"1": 1
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"0": 0,
"1": 1
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "ibert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"quant_mode": true,
"tokenizer_class": "RobertaTokenizer",
"torch_dtype": "int8",
"transformers_version": "4.11.0.dev0",
"type_vocab_size": 1,
"vocab_size": 50265

Can you tell me how you test the accuracy and time consumption of ibert in int8 inference? Would you mind sharing your test code?