intel-analytics/ipex-llm

how to switch to load multiple llm models in a streamlit page?

Opened this issue · 3 comments

I hope to switch llama2-7b-chat and llama3-8b models.
But it cost a lot of memory size if I load both.
How to clear one if I am going to load the second model?

#model_name = 'meta-llama/Llama-2-7b-chat-hf'
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'

#tokenizer_name = 'meta-llama/Llama-2-7b-chat-hf'
tokenizer_name = 'meta-llama/Meta-Llama-3-8B-Instruct'

llm_model = IpexLLM.from_model_id(
model_name=model_name,
tokenizer_name=tokenizer_name,
context_window=4096,
max_new_tokens=512,
load_in_low_bit='asym_int4',
completion_to_prompt=completion_to_prompt,
generate_kwargs={
"do_sample": True, 'temperature': 0.1,
"eos_token_id": [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]},
#messages_to_prompt=messages_to_prompt,
device_map='xpu',
)

You may clear the model with del llm_model.

Thanks for the good idea del llm_model , but I have another question.
The create_model() is set @st.cache_resource like source code below. In my understandings, the function create_model() just run once. After I delete the old model, I'd like to create a new model with create_model(). How do I make it rerun?

@st.cache_resource
def create_model(model_name):
    llm_model = IpexLLM.from_model_id(
        model_name=model_name,
        tokenizer_name=tokenizer_name,
        context_window=4096,
        max_new_tokens=512,
        load_in_low_bit='asym_int4',
        completion_to_prompt=completion_to_prompt,
        generate_kwargs={
        "do_sample": True, 'temperature': 0.1,
        "eos_token_id": [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]},
        #messages_to_prompt=messages_to_prompt,
        device_map='xpu',
    )

You may use st.cache_resource.clear() to rerun to create a new model as below:

model = create_model(name1)

del model
st.cache_resource.clear()

model = create_model(name2)