Export to ExecuTorch
guangy10 opened this issue ยท 7 comments
Feature request
Unlock a new workflow for on-device use-cases via torch.export and ExecuTorch.
So ideally I'd like to get the following workflow working:
- Load a model with StaticCache:
model = AutoModelForCausalLM.from_pretrained(
hf_model_repo,
config=config,
attn_implementation="sdpa",
cache_config={
"use_cache": True,
"cache_implementation": "static",
"max_cache_length": 128,
}, # Mandatory field to set ONLY for "Export to ExecuTorch" workflow, optional in other use-cases
)
- Then we can export the model with StaticCache w/o passing the cache-related args to the forward().
# This is the `forward()` signature for PreTrainedModel
#
# def forward(
# input_ids: torch.LongTensor = None,
# attention_mask: Optional[torch.Tensor] = None,
# position_ids: Optional[torch.LongTensor] = None,
# past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
# inputs_embeds: Optional[torch.FloatTensor] = None,
# labels: Optional[torch.LongTensor] = None,
# use_cache: Optional[bool] = None,
# output_attentions: Optional[bool] = None,
# output_hidden_states: Optional[bool] = None,
# return_dict: Optional[bool] = None,
# cache_position: Optional[torch.LongTensor] = None,
# ) -> Union[Tuple, CausalLMOutputWithPast]:
# Will NOT require passing `attention_mask`, `past_key_values`, `use_cache`, etc. optional cache config
# fields to `forward()` to get can exported model with `StaticCache` enabled.
exported = torch.export(
model,
args=(model_inputs,),
kwargs={"position_ids": <val>, "inputs_embeds": <val>, "cache_position": <val>}
or further lower to ExecuTorch with predefined recipes like:
executorch_m = export_to_executorch(
model,
export_args={"input_ids": <val>, "position_ids": <val>, "inputs_embeds": <val>, "cache_position": <val>},
recipes="xnnpack_q8", # 8bit quantized and delegate to XNNPACK
)
# The exported model is statically self-contained, i.e. whether it's using cache, type of cache, max length of
# the cache and sequence, etc. There is no need to specify those configs at generation time. They will be
# ignored and warned if specified.
ExecuTorch supported delegates to XNNPACK backend, Apple Core ML and MPS, Qualcomm QNN, ARM Ethos-U, Vulkan GPU and more. We can create recipes for supported backends, so that users can use it directly for their targeted use-cases.
- Use the exported/lowered artifact for inference:
# The exported artifact only contains the `Transformer` that predict a single token, the autoregressive logics
# will be in the `generate`:
generate(model=executorch_m, prompt="Hello world") # Will generate up to the maximal sequence length/cache length
The example workflow above shows direct integration between torch.export
+ExecuTorch
and HF transformers
models. Eventually this workflow could be accessible via optimum exporters-et
.
Issues Tracker
StaticCache
- Make
StaticCache
compatible withtorch.export
: PR #32168 - #32500
- #32503
- Support dynamic length slicing in
StaticCache
: PR #30862 - #32504
Models
E2E workflow via optimum
Motivation
Let me explain the motivation in a bigger picture.
The Ultimate Goal
The goal is to enable a new workflow "Export to ExecuTorch" from edge use-cases, just like onnx, tflite, torchscript, etc. via Optimum
.
Why is the option to statically configure cache needed?
torch.export()
doesn't support passing the StaticCache instance as a param toforward()
.
The dynamo tracing will fail with some error like this:
E torch._dynamo.exc.UserError: It looks like one of the inputs with type `<class 'transformers.cache_utils.StaticCache'>` is not supported or pytree-flattenable.
E Exported graphs inputs can only contain the following supported types: [<class 'torch.Tensor'>, <class 'torch.SymInt'>, <class 'torch.SymFloat'>, <class 'torch.SymBool'>, <class 'torch.ScriptObject'>, <class 'NoneType'>, <class 'complex'>, <class 'torch.dtype'>, <class 'str'>, <class 'bool'>, <class 'ellipsis'>, <class 'int'>, <class 'torch.layout'>, <class 'code'>, <class 'torch.memory_format'>, <class 'bytes'>, <class 'float'>, <class 'torch.device'>].
- Not only
torch.export()
, ExecuTorch also requires the model to be self-contained statically so that the Runtime can just load the serialized binary (.pte
) and run as-is, which means, the spec of the StaticCache must be part of the serialized binary.
For example, in ExecuTorch we have a c++ runtime for LLMs that could load the exported transform model for inference. To utilize that runtime, the forward()
must comply with the same signature, which looks like:
def forward(
token: torch.Tensor,
input_pos: Optional[torch.Tensor],
) -> torch.Tensor:
As shown in the prototype PR #31706 and in PR #32168, we can make it compatible with the forward()
of Hugging Face transformers PreTrainedModel
by just statically instantiating StaticCache
in the adapter forward()
method. However, it's not scalable to add such an adapter forward()
for all models that want to participate to "Export to ExecuTorch" workflow.
How to generate in a more scalable way?
If we could make the Cache statically configurable at model construction time, e.g. via AutoConfig
, there is no need to pass some optional args to the forward()
of Hugging Face transformers PreTrainedModel
, e.g. attention_mask
, past_key_values
, use_cache
, neither for export nor for generate
/inference using the exported artifact.
Is it compatible with existing config, e.g. generation config?
Yes, it's unlocking a new option mainly for export use-case and shouldn't have conflict with non-export use-cases where the cache can still be passed through the generation_config
.
Your contribution
- Co-design the "Export to ExecuTorch" workflow.
- Co-design the
generate
for exported model and the integration inOptimum
Here is how ExecuTorch implements the generate()
for llama2/3 in eager python and c++.
Thank you for detailing Executorch's goals ๐ค
Two follow-up questions:
- In the snippet you shared at the top, you explicitly load the model config before loading the model with
.from_pretrained
. However,.from_pretrained
handles loading the config and modification of the base config -- for instance,model = AutoModelForCausalLM.from_pretrained("distilgpt2", use_cache=False)
will changeuse_cache
inmodel.config
from the defaultTrue
toFalse
. Am I correct in saying that we don't need to manually load the config then? - We have been separating the parameterization of everything specifically related to auto-regressive generation to
generation_config
(i.e. it is not only forgenerate
, just likeconfig
is not only for our model classes). As such, we want to place the cache config ingeneration_config
, as KV caching only exists in auto-regressive generation.generation_config
is also loaded in.from_pretrained
, just asconfig
. However, updating parameters through.from_pretrained
is not yet supported (e.g..from_pretrained(model_repo, cache_config={...})
would't work). If the answer to 1. is yes: would this API [passing the cache config to.from_pretrained
] be useful to you?
@gante Thanks for the great follow-up questions:
For #1, yes if we can pass/override the config while loading the pretrained model, e.g. model = AutoModelForCausalLM.from_pretrained("distilgpt2", use_cache=True, cache_implementation="static", max_seq_lenght=128, attn_implementation="sdpa", ...)
, or even better to consolidate all cache related configs into one field cache_config
(something you and Arthur are suggesting?)
For #2, yes I understand there are use-cases where make cache config closer to auto-regressive generation is cleaner. KV cache config can still be passed through generation_config
in my proposal, for any use case, and no conflict. It's required to passed to .from_pretrained(cache_config={"use_cache": True, "cache_implementation": "static", "max_cache_length": 128, ...})
like this only when "Export to ExecuTorch". It's because ExecuTorch
is handling the memory planning ahead-of-time (during export and lowering to ExecuTorch
) so the Runtime doesn't deal with the dynamic memory allocation, that's where the fast inference comes from. And yes, the API [passing the cache config to .from_pretrained] will be useful to ExecuTorch
use-case!
Hey, saw your comments from another PR and wanted to share that I was thinking to make cache-config savable/loadable same way as generation config. It will hold all the needed args for all cache types, and loading a model from_pretrained
should also load cache config and assign self.cache_config=cache_config
. @gante WDYT about it?
I'm quite biased towards keeping the cache config inside generation_config
:
- A separate file to hold a handful of fields seems overkill
- Caching exists because of generation, it is not a stand-alone feature (contrarily to e.g. quantization, where it is not part of any existing configuration file)
But happy to reconsider if there are strong arguments to keep them separate :)
Wait, i just realized that we will save the cache config even if it's inside generation config. So it will be loadable from hub. Oke, that makes sense, thanks!
I'm quite biased towards keeping the cache config inside
generation_config
:
- A separate file to hold a handful of fields seems overkill
- Caching exists because of generation, it is not a stand-alone feature (contrarily to e.g. quantization, where it is not part of any existing configuration file)
But happy to reconsider if there are strong arguments to keep them separate :)
@gante, I see there are two orthogonal things from your and @zucchini-nlp 's comments. Let's get more clarify on it:
- Enable the ability to pass/override the cache config via
PreTrainedModel.from_pretrained()
It would take the cache config to construct the model. This is a new feature needed in order to support torch.export
and ExecuTorch
. I think we're on the same page on this?
- Decide on where to load/read the cache config from
So this is about whether PreTrainedModel.from_pretrained()
will load cache config from a separate config, generation config, or other config. I won't have strong option where it should go. To me, there may be a use case to quantize the cache at some point, and for torch.export
and ExecuTorch
, the quantization process is independent from generation.
It would take the cache config to construct the model. This is a new feature needed in order to support torch.export and ExecuTorch. I think we're on the same page on this?
Yes :)
To me, there may be a use case to quantize the cache at some point
We have indeed support for quantized caches! Their quantization configuration is set at initialization time, so it will belong in the cache config as well :) (we can have, e.g. a FP16 model and a quantized cache, to support very long generation)