[RfC] Ideas for better Hugging Face Hub integration
Closed this issue ยท 7 comments
This is a follow-up issue after #1510 which added integration with the Hugging Face Hub.
Hi @SamanehSaadat @mattdangerw I have adapted the snippet in this PR description + this colab. So far uploading a model to the HF Hub can be done like this:
import keras_nlp
from keras_nlp.models import BertClassifier
from keras_nlp.utils.preset_utils import save_to_preset, upload_preset
classifier = BertClassifier.from_preset("bert_base_en_uncased")
save_to_preset(classifier, "bert_base_en_uncased_retrained")
upload_preset("hf://Wauplin/bert_base_en_uncased_retrained", "bert_base_en_uncased_retrained", allow_incomplete=True)
I needed to add allow_incomplete=True
to make it work but I guess that's because BertClassifier
is not considered complete by itself? (anyway, not a big problem for me)
Here is how it looks like on the HF Hub.
From there, I have a few questions to make the integration even better on our side:
- We would like to tag all repos compatible with KerasNLP as
keras-nlp
so that users can filter them when listing models on the Hub. The best way to do that would be to add a model card (i.e. a README.md file) with some metadata in it and especiallytags: keras-nlp
.- Are model cards something you are planning to develop further (e.g. auto-generate them) in the Keras/KerasNLP ecosystem? If that's the case, it would be really nice for our integration. At the moment, KerasNLP models on the HF Hub have an empty page.
- Otherwise, would it be possible to create modelcards with minimal information at least when uploading the preset to the HF Hub? (instead of when saving with
save_pretrained
). - Without modelcards, we could infer the library type from
metadata.json
but we would prefer to avoid it as it would mean implementing and maintaining a custom parser server-side.
- According to
metadata.json
, the model uses Keras3. I thought all keras3 models would be saved as a singlemodel_xxx.keras
file with everything included. Am I wrong assuming this? When should we expect a.keras
file? I'm asking this because we thought we would be able to automatically tag all repos that have at least a.keras
file askeras3
. If that's not the case, we will need to rely on model card metadata (cf 1.) - At the moment, the KerasNLP model is auto-tagged as compatible transformers. This would be nice but it's not the case as long as the transformers <> KerasNLP compatibility is not implemented. What we can do in the meantime:
- Set
keras-nlp
as the default library. This will be easy once we know how to tag KerasNLP models (cf 1. and 2.). - Add a code snippet as well to show "how to load *** with KerasNLP". We would need your help on this to generate the correct code snippet for each model.
- Add a link to KerasNLP documentation from the model page
- Remove
transformers
tag/library/snippet from the repos for the time being (to do on our side).
- Set
I think that's it for now. Sorry about the disorganized questions, I hope to be exhaustive on what can be done ๐ It is mainly a question of metadata to correctly categorize models. Please let me know what you think.
Thanks! And sorry for the delay here. Thoughts below...
I needed to add
allow_incomplete=True
to make it work but I guess that's becauseBertClassifier
is not considered complete by itself? (anyway, not a big problem for me)
This is what #1547 is hoping to address. The high level flow will be:
classifier.save_to_preset("./local_dir")
upload_preset("hf://Wauplin/bert_base_en_uncased_retrained", "./local_dir")
- We would like to tag all repos compatible with KerasNLP as
keras-nlp
so that users can filter them when listing models on the Hub. The best way to do that would be to add a model card (i.e. a README.md file) with some metadata in it and especiallytags: keras-nlp
.
Yeah this is an interesting question, particular as model card is not an asset in Kaggle version. For kaggle, there is one model card stored in the metadata for a whole family of models.
i. Are model cards something you are planning to develop further (e.g. auto-generate them) in the Keras/KerasNLP ecosystem? If that's the case, it would be really nice for our integration. At the moment, KerasNLP models on the HF Hub have an empty page.
What does transformers do? Other projects? I am not sure what we should auto gen, given we don't know what fine-tuning etc has been done by the user.
ii. Otherwise, would it be possible to create modelcards with minimal information at least when uploading the preset to the HF Hub? (instead of when saving with
save_pretrained
).
Yes, this seems like a good plan. And given that Kaggle does not expect a README.md
in the assets, doing this specifically for huggingface might need to be the plan.
- According to
metadata.json
, the model uses Keras3. I thought all keras3 models would be saved as a singlemodel_xxx.keras
file with everything included.
A .keras
file is essentially a big zip of lower level assets (.json
, .weights.h5
files, other assets). Saving a keras-nlp model as a .keras
is fully supported. However, because we want to support a wider range of use cases (load a tokenizer without downloading 10gbs of weigths, upload just a diff of lora weights, etc), we decided to make our hub format just a dir of lower level assets.
- At the moment, the KerasNLP model is auto-tagged as compatible transformers. This would be nice but it's not the case as long as the transformers <> KerasNLP compatibility is not implemented. What we can do in the meantime:
- Set
keras-nlp
as the default library. This will be easy once we know how to tag KerasNLP models (cf 1. and 2.).
Sounds like as a start the minimal model card should be the plan.
- Add a code snippet as well to show "how to load *** with KerasNLP". We would need your help on this to generate the correct code snippet for each model.
This should be auto-generatable. Where does the auto gen code need to live? And be displayed?
- Add a link to KerasNLP documentation from the model page
Sound great! Is there a TODO there on Keras side?
- Remove
transformers
tag/library/snippet from the repos for the time being (to do on our side).
SGTM!
Hi @mattdangerw thanks for the comments and sorry for the delay as well!
Re: modelcards.
I think we agree that uploading a basic model card when uploading a KerasNLP to the Hugging Face Hub is the way to go then? It's not much of a problem if the model card is not complete. For other libraries, it varies a lot depending on the context. For example transformers
's Trainer class or diffusers
's Lora training script generates very detailed model cards with training information. The mergekit lib' generates model cards with the information about from the merge. In other cases, the model card is very sparse as we don't have much information when generating it (here is a basic example of mine). So for KerasNLP, we could have very basic information:
- a sentence "this model has been uploaded using the KerasNLP library" + links
- model name / architecture?
- task type? (text-generation, text-classification)
- tokenizer info?
- other information that might be relevant and can be inferred from the model itself?
- a sentence like "this model card has been generated automatically and should be completed by the model author" + link to these docs.
In addition to the "free-text" part of the modelcard, we should add information in the metadata section and in particular tag it askeras-nlp
(+keras3
?).
Would you like to draft a first model card template based on what you think can be automatically added to it and then we'll iterate from it?
Re: how/when to generate the model card?
I suggest that we generate the model_card in upload_preset
based on the config.json
file and only if preset.startswith("hf://")
. Otherwise we would have to pass the model to upload_preset
to generate the model card from it but that's not ideal in the described flow, right? (we could also have a model.generate_model_card()
method but since it's HF-specific maybe not the best).
Once we have a model card template, I'd be happy to open a PR to integrate it in the lib'.
Re: Where does the auto gen code need to live? And be displayed?
The code snippets are generated server-side based on this file. For example for diffusers
models, the default snippet is generated like this:
const diffusers_default = (model: ModelData) => [
`from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("${model.id}")`,
];
In transformers
it's a bit more complex due to the different architectures. So the easiest way to generate a code snippet would be if there was a AutoModel
-like class in KerasNLP e.g. a class/helper method that is able to load a model from a config.json/weights by guessing which model class to use. This is for example how the code snippet works for Spacy: nlp = spacy.load("${nameWithoutNamespace(model.id)}")
(Spacy taking care of loading the appropriate model).
Do you think that's something doable in KerasNLP? And if not, based on what information the code snippet should be generated?
Re: And be displayed?
There will be a Use in KerasNLP
button on the model page directly. Same as for transformers
models at the moment:
User can click on it to get the code snippet in a modal (here example with a diffusers
model):
If the user clicks on diffusers
, they are redirected to the diffusers docs. So for KerasNLP we could redirected to https://keras.io/keras_nlp/.
> Re: Sound great! Is there a TODO there on Keras side?
Nope! I opened a PR (huggingface/huggingface.js#616) to add keras-nlp
as an official library on the Hub. Once this is merged, all models tagged with it will have a "pretty name" and links to the docs. We now have 4 of them: https://huggingface.co/models?other=keras_nlp! This is why generating the model card (with metadata) will be important to provide the best UX.
@mattdangerw so to sum-up, the remaining tasks to tackle are:
- Provide a basic modelcard template that could be generated from a config.json file.
- Provide a way to generate code snippets. Simplest way server-side would be to only rely on model id (requires to implement some logic in KerasNLP). More complex way will be to build the logic server-side (from modelcard or config.json info?)
Let me know if I can help on any of those!
Provide a basic modelcard template that could be generated from a config.json file.
@SamanehSaadat will take a look at this.
Provide a way to generate code snippets.
We actually recently exposed our base classes. In part so people can more easily extend the library out of tree, and in part to support things like this! See some notes here https://github.com/keras-team/keras-nlp/releases/tag/v0.9.0 The easiest thing we could document for now is
import keras_nlp
tokenizer = keras_nlp.models.Tokenizer.from_preset("${model.id}")
backbone = keras_nlp.models.Backbone.from_preset("${model.id}")
This will work for all user uploads. We will need to extend this down the road. Once we add keras cv support, we might want to parse the metadata.json
to check for a keras_nlp_version
or keras_cv_version
.
We could also add some code that parses the class in the config.json
and generates different samples depending on the model architecture. This would be highly useful, but also depends on how fancy we want to get. This could also go in the README.md
if we want.
We actually recently exposed our base classes. In part so people can more easily extend the library out of tree, and in part to support things like this!
Nice! This is exactly what we need to generate simple code snippets on the Hub! For now let's start with documenting Tokenizer.from_preset
and Backbone.from_preset
on the model page. I opened a PR on our side to support it: huggingface/huggingface.js#628.
We could also add some code that parses the class in the config.json and generates different samples depending on the model architecture. This would be highly useful, but also depends on how fancy we want to get. This could also go in the README.md if we want.
If it can be done in the README that would be really nice yes! Once we have a more refined "model config to snippet" code on the Python side, we could think of making it more official on the Hub. Looking forward to see a model card template :) I think it's the last big piece we need to tightly integrate things!
Thanks to #1578, model cards are now autogenerated. Combined with huggingface/huggingface.js#628, I think we can now consider this issue as complete. Here are a few screenshots showcasing the HF Hub integration:
Landing page for KerasNLP models: https://huggingface.co/samanehs/bert_tiny_en_uncased_classifier
See KerasNLP
as tag + </> Use in KerasNLP button
.
Code snippet: https://huggingface.co/samanehs/bert_tiny_en_uncased_classifier?library=true
Clicking on KerasNLP
here redirects to https://github.com/keras-team/keras-nlp.
And finally, by clicking on the KerasNLP
tag, we can search all models tagged as such: https://huggingface.co/models?library=keras-nlp
Thanks for the collaboration and looking forward to expand its usage! ๐ค
These are great ๐ Thank you so much for making these changes @Wauplin!