intel-analytics/ipex-llm

Not able to profile LLAMA2 on iGFX (windows)

vmadananth opened this issue · 3 comments

I am trying to get some profiling runs with IPEX-LLM on LLAMA2.
I am able to run inference using the instructions provided in https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2
I have tried pytorch profiler and vtune. Do you have any guidance on how I can get layer by layer performance?
The profiler context snippet does not get executed.
Here is my code -

if name == 'main':
parser = argparse.ArgumentParser(description='Predict Tokens using generate() API for Llama2 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="TheBloke/Llama-2-7B-Chat-AWQ",
help='The huggingface repo id for the Llama2 (e.g. meta-llama/Llama-2-7b-chat-hf and meta-llama/Llama-2-13b-chat-hf) to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--save-path', type=str, default=None,
help='The path to save the low-bit model.')
parser.add_argument('--load-path', type=str, default=None,
help='The path to load the low-bit model.')
parser.add_argument('--prompt', type=str, default="What is AI?",
help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict')

args = parser.parse_args()
model_path = args.repo_id_or_model_path

# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
load_path = args.load_path
if load_path:
    model = AutoModelForCausalLM.load_low_bit(load_path, trust_remote_code=True)
    tokenizer = LlamaTokenizer.from_pretrained(load_path)
else:
    llama_model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             optimize_model=True,
                                             trust_remote_code=True,
                                             use_cache=True
                                             ,cpu_embedding=True
                                             )
    tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)

save_path = args.save_path
if save_path:
    llama_model.save_low_bit(save_path)
    tokenizer.save_pretrained(save_path)
    print(f"Model and tokenizer are saved to {save_path}")
    
llama_model = llama_model.half().to('xpu')

print(llama_model.device)

# Generate predicted tokens
with torch.inference_mode():
      # warmup
    torch.xpu.synchronize()
    prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
    print("input length is: ", len((input_ids[0])))
    output = llama_model.generate(input_ids, do_sample=False, max_new_tokens=32)
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    torch.xpu.synchronize()
    print(output_str)
    e2e_time = []
  
    print("hello")
    with profile(activities=[ProfilerActivity.XPU, ProfilerActivity.CPU], profile_memory=True, record_shapes=True) as prof:
        with record_function("llama_model_generate"):  # Record the llama_model.generate function
            output = llama_model.generate(input_ids, do_sample=False, max_new_tokens=32)
            output_str = tokenizer.decode(output[0], skip_special_tokens=True)
            torch.xpu.synchronize()
            print(output_str)
    if prof.events():
        print("Profiling data was collected.")
        print(prof.key_averages().table(sort_by="self_xpu_time_total", row_limit=-1))
        with open("llama7b_int4.log", "w") as fw:
            fw.write(prof.key_averages(group_by_input_shape=True).table(sort_by="self_xpu_time_total"))
    else:
        print("No profiling data was collected.")

Hi @vmadananth ,
pytorch xpu profiler is not supported on Windows now.
You can use vtune to obtain some kernel level profile (https://www.intel.com/content/www/us/en/docs/vtune-profiler/get-started-guide/2023/windows-os.html).
But now there is no accurate way to obtain the layer by layer performance. (Reverse module deletion may be a method, but it may not be particularly accurate)

Thank you. Is there a way to understand graph optimizations done by IPEX-LLM?

Hi @vmadananth ,
I'm not sure what you mean by graph optimization here? My understanding of graph optimization seems to be for static graphs such as TF/ONNX, while IPEX-LLM [XPU] is developed based on torch and is a dynamic graph.