Not able to profile LLAMA2 on iGFX (windows)
vmadananth opened this issue · 3 comments
I am trying to get some profiling runs with IPEX-LLM on LLAMA2.
I am able to run inference using the instructions provided in https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2
I have tried pytorch profiler and vtune. Do you have any guidance on how I can get layer by layer performance?
The profiler context snippet does not get executed.
Here is my code -
if name == 'main':
parser = argparse.ArgumentParser(description='Predict Tokens using generate()
API for Llama2 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="TheBloke/Llama-2-7B-Chat-AWQ",
help='The huggingface repo id for the Llama2 (e.g. meta-llama/Llama-2-7b-chat-hf
and meta-llama/Llama-2-13b-chat-hf
) to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--save-path', type=str, default=None,
help='The path to save the low-bit model.')
parser.add_argument('--load-path', type=str, default=None,
help='The path to load the low-bit model.')
parser.add_argument('--prompt', type=str, default="What is AI?",
help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
load_path = args.load_path
if load_path:
model = AutoModelForCausalLM.load_low_bit(load_path, trust_remote_code=True)
tokenizer = LlamaTokenizer.from_pretrained(load_path)
else:
llama_model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True,
optimize_model=True,
trust_remote_code=True,
use_cache=True
,cpu_embedding=True
)
tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)
save_path = args.save_path
if save_path:
llama_model.save_low_bit(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer are saved to {save_path}")
llama_model = llama_model.half().to('xpu')
print(llama_model.device)
# Generate predicted tokens
with torch.inference_mode():
# warmup
torch.xpu.synchronize()
prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
print("input length is: ", len((input_ids[0])))
output = llama_model.generate(input_ids, do_sample=False, max_new_tokens=32)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
torch.xpu.synchronize()
print(output_str)
e2e_time = []
print("hello")
with profile(activities=[ProfilerActivity.XPU, ProfilerActivity.CPU], profile_memory=True, record_shapes=True) as prof:
with record_function("llama_model_generate"): # Record the llama_model.generate function
output = llama_model.generate(input_ids, do_sample=False, max_new_tokens=32)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
torch.xpu.synchronize()
print(output_str)
if prof.events():
print("Profiling data was collected.")
print(prof.key_averages().table(sort_by="self_xpu_time_total", row_limit=-1))
with open("llama7b_int4.log", "w") as fw:
fw.write(prof.key_averages(group_by_input_shape=True).table(sort_by="self_xpu_time_total"))
else:
print("No profiling data was collected.")
Hi @vmadananth ,
pytorch xpu profiler is not supported on Windows now.
You can use vtune to obtain some kernel level profile (https://www.intel.com/content/www/us/en/docs/vtune-profiler/get-started-guide/2023/windows-os.html).
But now there is no accurate way to obtain the layer by layer performance. (Reverse module deletion may be a method, but it may not be particularly accurate)
Thank you. Is there a way to understand graph optimizations done by IPEX-LLM?
Hi @vmadananth ,
I'm not sure what you mean by graph optimization here? My understanding of graph optimization seems to be for static graphs such as TF/ONNX, while IPEX-LLM [XPU] is developed based on torch and is a dynamic graph.