loading models is painful and not HF compatible

To load a sliced model, we first load an uninitialized model, slice it, and load the checkpoint in. This is a pain for a few reasons:

Adding new models means adding a switch in this code:

TransformerCompression/src/slicegpt/hf_utils.py

Line 75 in a369325

def get_model_and_tokenizer(
it's not easy for HF users to use our models directly, without running slicing themselves. It would be great if users could just do AutoModelForCausalLM.from_pretrained('microsoft/sliced-llama2-13B-30pc'). This would mean publishing such compatible models on HF, which would mean creating the model class explicitly.

Things to consider for a solution:

we'd probably need to store the "new hidden size" in the config somehow
we should make sure this doesn't block us from doing slicing with different levels per layer.
adding new models should "just work", without the current if/else in hf_utils

Give feedback