microsoft/TransformerCompression

loading models is painful and not HF compatible

jameshensman opened this issue · 0 comments

To load a sliced model, we first load an uninitialized model, slice it, and load the checkpoint in. This is a pain for a few reasons:

  1. Adding new models means adding a switch in this code:

    def get_model_and_tokenizer(

  2. it's not easy for HF users to use our models directly, without running slicing themselves. It would be great if users could just do AutoModelForCausalLM.from_pretrained('microsoft/sliced-llama2-13B-30pc'). This would mean publishing such compatible models on HF, which would mean creating the model class explicitly.

Things to consider for a solution:

  • we'd probably need to store the "new hidden size" in the config somehow
  • we should make sure this doesn't block us from doing slicing with different levels per layer.
  • adding new models should "just work", without the current if/else in hf_utils

Tasks