huggingface/transformers

Export to ExecuTorch

guangy10 opened this issue ยท 7 comments

Feature request

Unlock a new workflow for on-device use-cases via torch.export and ExecuTorch.

So ideally I'd like to get the following workflow working:

  1. Load a model with StaticCache:
model = AutoModelForCausalLM.from_pretrained(
    hf_model_repo,
    config=config,
    attn_implementation="sdpa",
    cache_config={
        "use_cache": True, 
        "cache_implementation": "static", 
        "max_cache_length": 128,
    },  # Mandatory field to set ONLY for "Export to ExecuTorch" workflow, optional in other use-cases
)
  1. Then we can export the model with StaticCache w/o passing the cache-related args to the forward().
# This is the `forward()` signature for PreTrainedModel
# 
# def forward(
#     input_ids: torch.LongTensor = None,
#     attention_mask: Optional[torch.Tensor] = None,
#     position_ids: Optional[torch.LongTensor] = None,
#     past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
#     inputs_embeds: Optional[torch.FloatTensor] = None,
#     labels: Optional[torch.LongTensor] = None,
#     use_cache: Optional[bool] = None,
#     output_attentions: Optional[bool] = None,
#     output_hidden_states: Optional[bool] = None,
#     return_dict: Optional[bool] = None,
#     cache_position: Optional[torch.LongTensor] = None,
# ) -> Union[Tuple, CausalLMOutputWithPast]:

# Will NOT require passing `attention_mask`, `past_key_values`, `use_cache`, etc. optional cache config 
# fields to `forward()` to get can exported model with `StaticCache` enabled.

exported = torch.export(
    model, 
    args=(model_inputs,), 
    kwargs={"position_ids": <val>, "inputs_embeds": <val>, "cache_position": <val>}

or further lower to ExecuTorch with predefined recipes like:

executorch_m = export_to_executorch(
    model, 
    export_args={"input_ids": <val>, "position_ids": <val>, "inputs_embeds": <val>, "cache_position": <val>},
    recipes="xnnpack_q8",  # 8bit quantized and delegate to XNNPACK
)

# The exported model is statically self-contained, i.e. whether it's using cache, type of cache, max length of 
# the cache and sequence, etc. There is no need to specify those configs at generation time. They will be 
# ignored and warned if specified.

ExecuTorch supported delegates to XNNPACK backend, Apple Core ML and MPS, Qualcomm QNN, ARM Ethos-U, Vulkan GPU and more. We can create recipes for supported backends, so that users can use it directly for their targeted use-cases.

  1. Use the exported/lowered artifact for inference:


# The exported artifact only contains the `Transformer` that predict a single token, the autoregressive logics 
# will be in the `generate`:

generate(model=executorch_m, prompt="Hello world")  # Will generate up to the maximal sequence length/cache length 

The example workflow above shows direct integration between torch.export+ExecuTorch and HF transformers models. Eventually this workflow could be accessible via optimum exporters-et.

Issues Tracker

StaticCache

Models

E2E workflow via optimum

Motivation

Let me explain the motivation in a bigger picture.

The Ultimate Goal

The goal is to enable a new workflow "Export to ExecuTorch" from edge use-cases, just like onnx, tflite, torchscript, etc. via Optimum.

Why is the option to statically configure cache needed?

  1. torch.export() doesn't support passing the StaticCache instance as a param to forward().
    The dynamo tracing will fail with some error like this:
E    torch._dynamo.exc.UserError: It looks like one of the inputs with type `<class 'transformers.cache_utils.StaticCache'>` is not supported or pytree-flattenable.
E    Exported graphs inputs can only contain the following supported types: [<class 'torch.Tensor'>, <class 'torch.SymInt'>, <class 'torch.SymFloat'>, <class 'torch.SymBool'>, <class 'torch.ScriptObject'>, <class 'NoneType'>, <class 'complex'>, <class 'torch.dtype'>, <class 'str'>, <class 'bool'>, <class 'ellipsis'>, <class 'int'>, <class 'torch.layout'>, <class 'code'>, <class 'torch.memory_format'>, <class 'bytes'>, <class 'float'>, <class 'torch.device'>].
  1. Not only torch.export(), ExecuTorch also requires the model to be self-contained statically so that the Runtime can just load the serialized binary (.pte) and run as-is, which means, the spec of the StaticCache must be part of the serialized binary.

For example, in ExecuTorch we have a c++ runtime for LLMs that could load the exported transform model for inference. To utilize that runtime, the forward() must comply with the same signature, which looks like:

def forward(
    token: torch.Tensor,
    input_pos: Optional[torch.Tensor],
) -> torch.Tensor:

As shown in the prototype PR #31706 and in PR #32168, we can make it compatible with the forward() of Hugging Face transformers PreTrainedModel by just statically instantiating StaticCache in the adapter forward() method. However, it's not scalable to add such an adapter forward() for all models that want to participate to "Export to ExecuTorch" workflow.

How to generate in a more scalable way?

If we could make the Cache statically configurable at model construction time, e.g. via AutoConfig, there is no need to pass some optional args to the forward() of Hugging Face transformers PreTrainedModel, e.g. attention_mask, past_key_values, use_cache, neither for export nor for generate/inference using the exported artifact.

Is it compatible with existing config, e.g. generation config?

Yes, it's unlocking a new option mainly for export use-case and shouldn't have conflict with non-export use-cases where the cache can still be passed through the generation_config.

Your contribution

  1. Co-design the "Export to ExecuTorch" workflow.
  2. Co-design the generate for exported model and the integration in Optimum

Here is how ExecuTorch implements the generate() for llama2/3 in eager python and c++.

cc: @amyeroberts @gante @ArthurZucker @michaelbenayoun

Thank you for detailing Executorch's goals ๐Ÿค—

Two follow-up questions:

  1. In the snippet you shared at the top, you explicitly load the model config before loading the model with .from_pretrained. However, .from_pretrained handles loading the config and modification of the base config -- for instance, model = AutoModelForCausalLM.from_pretrained("distilgpt2", use_cache=False) will change use_cache in model.config from the default True to False. Am I correct in saying that we don't need to manually load the config then?
  2. We have been separating the parameterization of everything specifically related to auto-regressive generation to generation_config (i.e. it is not only for generate, just like config is not only for our model classes). As such, we want to place the cache config in generation_config, as KV caching only exists in auto-regressive generation. generation_config is also loaded in .from_pretrained, just as config. However, updating parameters through .from_pretrained is not yet supported (e.g. .from_pretrained(model_repo, cache_config={...}) would't work). If the answer to 1. is yes: would this API [passing the cache config to .from_pretrained] be useful to you?

@gante Thanks for the great follow-up questions:

For #1, yes if we can pass/override the config while loading the pretrained model, e.g. model = AutoModelForCausalLM.from_pretrained("distilgpt2", use_cache=True, cache_implementation="static", max_seq_lenght=128, attn_implementation="sdpa", ...), or even better to consolidate all cache related configs into one field cache_config (something you and Arthur are suggesting?)

For #2, yes I understand there are use-cases where make cache config closer to auto-regressive generation is cleaner. KV cache config can still be passed through generation_config in my proposal, for any use case, and no conflict. It's required to passed to .from_pretrained(cache_config={"use_cache": True, "cache_implementation": "static", "max_cache_length": 128, ...})like this only when "Export to ExecuTorch". It's because ExecuTorch is handling the memory planning ahead-of-time (during export and lowering to ExecuTorch) so the Runtime doesn't deal with the dynamic memory allocation, that's where the fast inference comes from. And yes, the API [passing the cache config to .from_pretrained] will be useful to ExecuTorch use-case!

Hey, saw your comments from another PR and wanted to share that I was thinking to make cache-config savable/loadable same way as generation config. It will hold all the needed args for all cache types, and loading a model from_pretrained should also load cache config and assign self.cache_config=cache_config. @gante WDYT about it?

I'm quite biased towards keeping the cache config inside generation_config:

  1. A separate file to hold a handful of fields seems overkill
  2. Caching exists because of generation, it is not a stand-alone feature (contrarily to e.g. quantization, where it is not part of any existing configuration file)

But happy to reconsider if there are strong arguments to keep them separate :)

Wait, i just realized that we will save the cache config even if it's inside generation config. So it will be loadable from hub. Oke, that makes sense, thanks!

I'm quite biased towards keeping the cache config inside generation_config:

  1. A separate file to hold a handful of fields seems overkill
  2. Caching exists because of generation, it is not a stand-alone feature (contrarily to e.g. quantization, where it is not part of any existing configuration file)

But happy to reconsider if there are strong arguments to keep them separate :)

@gante, I see there are two orthogonal things from your and @zucchini-nlp 's comments. Let's get more clarify on it:

  1. Enable the ability to pass/override the cache config via PreTrainedModel.from_pretrained()

It would take the cache config to construct the model. This is a new feature needed in order to support torch.export and ExecuTorch. I think we're on the same page on this?

  1. Decide on where to load/read the cache config from

So this is about whether PreTrainedModel.from_pretrained() will load cache config from a separate config, generation config, or other config. I won't have strong option where it should go. To me, there may be a use case to quantize the cache at some point, and for torch.export and ExecuTorch, the quantization process is independent from generation.

It would take the cache config to construct the model. This is a new feature needed in order to support torch.export and ExecuTorch. I think we're on the same page on this?

Yes :)

To me, there may be a use case to quantize the cache at some point

We have indeed support for quantized caches! Their quantization configuration is set at initialization time, so it will belong in the cache config as well :) (we can have, e.g. a FP16 model and a quantized cache, to support very long generation)