how to switch to load multiple llm models in a streamlit page?
Opened this issue · 3 comments
I hope to switch llama2-7b-chat and llama3-8b models.
But it cost a lot of memory size if I load both.
How to clear one if I am going to load the second model?
#model_name = 'meta-llama/Llama-2-7b-chat-hf'
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'
#tokenizer_name = 'meta-llama/Llama-2-7b-chat-hf'
tokenizer_name = 'meta-llama/Meta-Llama-3-8B-Instruct'
llm_model = IpexLLM.from_model_id(
model_name=model_name,
tokenizer_name=tokenizer_name,
context_window=4096,
max_new_tokens=512,
load_in_low_bit='asym_int4',
completion_to_prompt=completion_to_prompt,
generate_kwargs={
"do_sample": True, 'temperature': 0.1,
"eos_token_id": [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]},
#messages_to_prompt=messages_to_prompt,
device_map='xpu',
)
You may clear the model with del llm_model
.
Thanks for the good idea del llm_model
, but I have another question.
The create_model() is set @st.cache_resource
like source code below. In my understandings, the function create_model() just run once. After I delete the old model, I'd like to create a new model with create_model(). How do I make it rerun?
@st.cache_resource
def create_model(model_name):
llm_model = IpexLLM.from_model_id(
model_name=model_name,
tokenizer_name=tokenizer_name,
context_window=4096,
max_new_tokens=512,
load_in_low_bit='asym_int4',
completion_to_prompt=completion_to_prompt,
generate_kwargs={
"do_sample": True, 'temperature': 0.1,
"eos_token_id": [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]},
#messages_to_prompt=messages_to_prompt,
device_map='xpu',
)
You may use st.cache_resource.clear()
to rerun to create a new model as below:
model = create_model(name1)
del model
st.cache_resource.clear()
model = create_model(name2)