microsoft/TransformerCompression

Is there any inference demo for sliced model?

zhaoyang-star opened this issue · 0 comments

I noticed that loading sliced model and then execute mode.generate() will return wrong output compared to the dense model. From the run_benchmark.py I could get limited info about how to run the sliced model. So is it possible to provide an inference toy demo for sliced model? So we can run the dense and sliced model under the same prompt and compare the outputs. Thanks.

Part of the inference code from gpu_utils.py:

        for i in tqdm(range(input_seq_len), desc="Benchmarking"):
            input_ids_i = input_ids[:, i].reshape((batch_size, 1)).to(config.device)
            attention_mask_i = attention_mask[:, : (i + 1)].to(config.device)

            sync_gpus()
            start_time = time.time()
            output = model_adapter.model(input_ids_i, past_key_values=cache["past"], attention_mask=attention_mask_i)
            sync_gpus()
            time_measurements.append(time.time() - start_time)

            cache["past"] = list(output.past_key_values)
            del output

            input_ids_i, attention_mask_i = input_ids_i.to("cpu"), attention_mask_i.to("cpu")