[RfC] Ideas for better Hugging Face Hub integration

Question

[RfC] Ideas for better Hugging Face Hub integration

Closed this issue 2 months ago · 7 comments

This is a follow-up issue after #1510 which added integration with the Hugging Face Hub.

Hi @SamanehSaadat @mattdangerw I have adapted the snippet in this PR description + this colab. So far uploading a model to the HF Hub can be done like this:

import keras_nlp
from keras_nlp.models import BertClassifier
from keras_nlp.utils.preset_utils import save_to_preset, upload_preset

classifier = BertClassifier.from_preset("bert_base_en_uncased")
save_to_preset(classifier, "bert_base_en_uncased_retrained")
upload_preset("hf://Wauplin/bert_base_en_uncased_retrained", "bert_base_en_uncased_retrained", allow_incomplete=True)

I needed to add allow_incomplete=True to make it work but I guess that's because BertClassifier is not considered complete by itself? (anyway, not a big problem for me)

Here is how it looks like on the HF Hub.
From there, I have a few questions to make the integration even better on our side:

We would like to tag all repos compatible with KerasNLP as keras-nlp so that users can filter them when listing models on the Hub. The best way to do that would be to add a model card (i.e. a README.md file) with some metadata in it and especially tags: keras-nlp.
1. Are model cards something you are planning to develop further (e.g. auto-generate them) in the Keras/KerasNLP ecosystem? If that's the case, it would be really nice for our integration. At the moment, KerasNLP models on the HF Hub have an empty page.
2. Otherwise, would it be possible to create modelcards with minimal information at least when uploading the preset to the HF Hub? (instead of when saving with save_pretrained).
3. Without modelcards, we could infer the library type from metadata.json but we would prefer to avoid it as it would mean implementing and maintaining a custom parser server-side.
According to metadata.json, the model uses Keras3. I thought all keras3 models would be saved as a single model_xxx.keras file with everything included. Am I wrong assuming this? When should we expect a .keras file? I'm asking this because we thought we would be able to automatically tag all repos that have at least a .keras file as keras3. If that's not the case, we will need to rely on model card metadata (cf 1.)
At the moment, the KerasNLP model is auto-tagged as compatible transformers. This would be nice but it's not the case as long as the transformers <> KerasNLP compatibility is not implemented. What we can do in the meantime:
1. Set keras-nlp as the default library. This will be easy once we know how to tag KerasNLP models (cf 1. and 2.).
2. Add a code snippet as well to show "how to load *** with KerasNLP". We would need your help on this to generate the correct code snippet for each model.
3. Add a link to KerasNLP documentation from the model page
4. Remove transformers tag/library/snippet from the repos for the time being (to do on our side).

I think that's it for now. Sorry about the disorganized questions, I hope to be exhaustive on what can be done 😄 It is mainly a question of metadata to correctly categorize models. Please let me know what you think.

Answer 1 · 2024-04-05T22:41:14.000Z

Thanks! And sorry for the delay here. Thoughts below...

I needed to add allow_incomplete=True to make it work but I guess that's because BertClassifier is not considered complete by itself? (anyway, not a big problem for me)

This is what #1547 is hoping to address. The high level flow will be:

classifier.save_to_preset("./local_dir")
upload_preset("hf://Wauplin/bert_base_en_uncased_retrained", "./local_dir")

We would like to tag all repos compatible with KerasNLP as keras-nlp so that users can filter them when listing models on the Hub. The best way to do that would be to add a model card (i.e. a README.md file) with some metadata in it and especially tags: keras-nlp.

Yeah this is an interesting question, particular as model card is not an asset in Kaggle version. For kaggle, there is one model card stored in the metadata for a whole family of models.

i. Are model cards something you are planning to develop further (e.g. auto-generate them) in the Keras/KerasNLP ecosystem? If that's the case, it would be really nice for our integration. At the moment, KerasNLP models on the HF Hub have an empty page.

What does transformers do? Other projects? I am not sure what we should auto gen, given we don't know what fine-tuning etc has been done by the user.

ii. Otherwise, would it be possible to create modelcards with minimal information at least when uploading the preset to the HF Hub? (instead of when saving with save_pretrained).

Yes, this seems like a good plan. And given that Kaggle does not expect a README.md in the assets, doing this specifically for huggingface might need to be the plan.

According to metadata.json, the model uses Keras3. I thought all keras3 models would be saved as a single model_xxx.keras file with everything included.

A .keras file is essentially a big zip of lower level assets (.json, .weights.h5 files, other assets). Saving a keras-nlp model as a .keras is fully supported. However, because we want to support a wider range of use cases (load a tokenizer without downloading 10gbs of weigths, upload just a diff of lora weights, etc), we decided to make our hub format just a dir of lower level assets.

At the moment, the KerasNLP model is auto-tagged as compatible transformers. This would be nice but it's not the case as long as the transformers <> KerasNLP compatibility is not implemented. What we can do in the meantime:

Set keras-nlp as the default library. This will be easy once we know how to tag KerasNLP models (cf 1. and 2.).

Sounds like as a start the minimal model card should be the plan.

Add a code snippet as well to show "how to load *** with KerasNLP". We would need your help on this to generate the correct code snippet for each model.

This should be auto-generatable. Where does the auto gen code need to live? And be displayed?

Add a link to KerasNLP documentation from the model page

Sound great! Is there a TODO there on Keras side?

Remove transformers tag/library/snippet from the repos for the time being (to do on our side).

SGTM!

Answer 2 · 2024-04-10T11:37:19.000Z

Hi @mattdangerw thanks for the comments and sorry for the delay as well!

Re: modelcards.

I think we agree that uploading a basic model card when uploading a KerasNLP to the Hugging Face Hub is the way to go then? It's not much of a problem if the model card is not complete. For other libraries, it varies a lot depending on the context. For example transformers's Trainer class or diffusers's Lora training script generates very detailed model cards with training information. The mergekit lib' generates model cards with the information about from the merge. In other cases, the model card is very sparse as we don't have much information when generating it (here is a basic example of mine). So for KerasNLP, we could have very basic information:

a sentence "this model has been uploaded using the KerasNLP library" + links
model name / architecture?
task type? (text-generation, text-classification)
tokenizer info?
other information that might be relevant and can be inferred from the model itself?
a sentence like "this model card has been generated automatically and should be completed by the model author" + link to these docs.
In addition to the "free-text" part of the modelcard, we should add information in the metadata section and in particular tag it as keras-nlp (+keras3?).

Would you like to draft a first model card template based on what you think can be automatically added to it and then we'll iterate from it?

Re: how/when to generate the model card?
I suggest that we generate the model_card in upload_preset based on the config.json file and only if preset.startswith("hf://"). Otherwise we would have to pass the model to upload_preset to generate the model card from it but that's not ideal in the described flow, right? (we could also have a model.generate_model_card() method but since it's HF-specific maybe not the best).
Once we have a model card template, I'd be happy to open a PR to integrate it in the lib'.

Re: Where does the auto gen code need to live? And be displayed?

The code snippets are generated server-side based on this file. For example for diffusers models, the default snippet is generated like this:

const diffusers_default = (model: ModelData) => [
	`from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained("${model.id}")`,
];

In transformers it's a bit more complex due to the different architectures. So the easiest way to generate a code snippet would be if there was a AutoModel-like class in KerasNLP e.g. a class/helper method that is able to load a model from a config.json/weights by guessing which model class to use. This is for example how the code snippet works for Spacy: nlp = spacy.load("${nameWithoutNamespace(model.id)}") (Spacy taking care of loading the appropriate model).

Do you think that's something doable in KerasNLP? And if not, based on what information the code snippet should be generated?

Re: And be displayed?

There will be a Use in KerasNLP button on the model page directly. Same as for transformers models at the moment:

User can click on it to get the code snippet in a modal (here example with a diffusers model):

If the user clicks on diffusers, they are redirected to the diffusers docs. So for KerasNLP we could redirected to https://keras.io/keras_nlp/.

> Re: Sound great! Is there a TODO there on Keras side?

Nope! I opened a PR (huggingface/huggingface.js#616) to add keras-nlp as an official library on the Hub. Once this is merged, all models tagged with it will have a "pretty name" and links to the docs. We now have 4 of them: https://huggingface.co/models?other=keras_nlp! This is why generating the model card (with metadata) will be important to provide the best UX.

Answer 3 · 2024-04-10T11:40:01.000Z

@mattdangerw so to sum-up, the remaining tasks to tackle are:

Provide a basic modelcard template that could be generated from a config.json file.
Provide a way to generate code snippets. Simplest way server-side would be to only rely on model id (requires to implement some logic in KerasNLP). More complex way will be to build the logic server-side (from modelcard or config.json info?)

Let me know if I can help on any of those!

Answer 4 · 2024-04-12T23:31:03.000Z

Provide a basic modelcard template that could be generated from a config.json file.

@SamanehSaadat will take a look at this.

Provide a way to generate code snippets.

We actually recently exposed our base classes. In part so people can more easily extend the library out of tree, and in part to support things like this! See some notes here https://github.com/keras-team/keras-nlp/releases/tag/v0.9.0 The easiest thing we could document for now is

import keras_nlp

tokenizer = keras_nlp.models.Tokenizer.from_preset("${model.id}")
backbone = keras_nlp.models.Backbone.from_preset("${model.id}")

This will work for all user uploads. We will need to extend this down the road. Once we add keras cv support, we might want to parse the metadata.json to check for a keras_nlp_version or keras_cv_version.

We could also add some code that parses the class in the config.json and generates different samples depending on the model architecture. This would be highly useful, but also depends on how fancy we want to get. This could also go in the README.md if we want.

Answer 5 · 2024-04-15T12:55:17.000Z

We actually recently exposed our base classes. In part so people can more easily extend the library out of tree, and in part to support things like this!

Nice! This is exactly what we need to generate simple code snippets on the Hub! For now let's start with documenting Tokenizer.from_preset and Backbone.from_preset on the model page. I opened a PR on our side to support it: huggingface/huggingface.js#628.

We could also add some code that parses the class in the config.json and generates different samples depending on the model architecture. This would be highly useful, but also depends on how fancy we want to get. This could also go in the README.md if we want.

If it can be done in the README that would be really nice yes! Once we have a more refined "model config to snippet" code on the Python side, we could think of making it more official on the Hub. Looking forward to see a model card template :) I think it's the last big piece we need to tightly integrate things!

Answer 6 · 2024-04-24T11:41:24.000Z

Thanks to #1578, model cards are now autogenerated. Combined with huggingface/huggingface.js#628, I think we can now consider this issue as complete. Here are a few screenshots showcasing the HF Hub integration:

Landing page for KerasNLP models: https://huggingface.co/samanehs/bert_tiny_en_uncased_classifier
See KerasNLP as tag + </> Use in KerasNLP button.

Code snippet: https://huggingface.co/samanehs/bert_tiny_en_uncased_classifier?library=true
Clicking on KerasNLP here redirects to https://github.com/keras-team/keras-nlp.

And finally, by clicking on the KerasNLP tag, we can search all models tagged as such: https://huggingface.co/models?library=keras-nlp

Thanks for the collaboration and looking forward to expand its usage! 🤗

Answer 7 · 2024-04-24T23:50:16.000Z

These are great 🎉 Thank you so much for making these changes @Wauplin!