ggerganov/llama.cpp

Metal (iOS): Compute function exceeds available temporary registers

guinmoon opened this issue · 5 comments

llama.cpp b2864
iPhone 12 pro Max
if
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_FLASH_ATTN_EXT_F16_H256, flash_attn_ext_f16_h256, ctx->support_simdgroup_mm);
i get:

llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /var/mobile/Containers/Data/Application/1C5A0067-4072-44E5-BF9C-3294A335FAC2/Documents/models/Phi-3-mini-128k-instruct.IQ4_NL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = phi3
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 131072
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32

llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 25
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32064
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 96
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32064]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32064]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:  225 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ4_NL - 4.5 bpw
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_tensors: ggml ctx size =    0.30 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  1536.00 MiB, ( 1536.06 /  4096.02)

ggml_backend_metal_log_allocated_size: allocated buffer, size =   562.91 MiB, ( 2098.97 /  4096.02)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    52.84 MiB
llm_load_tensors:      Metal buffer size =  2021.82 MiB
llama_new_context_with_model: n_ctx      = 1536
llama_new_context_with_model: n_batch    = 1536
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple A14 GPU
ggml_metal_init: loading '/var/containers/Bundle/Application/53A850DA-E8BE-4131-A8D3-485E31767545/LLMFarm.app/llmfarm_core_llmfarm_core_cpp.bundle/default.metallib'
ggml_metal_init: GPU name:   Apple A14 GPU
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  4294.98 MB
ggml_metal_init: error: load pipeline error: Error Domain=AGXMetalA14 Code=3 "Compute function exceeds available temporary registers" UserInfo={NSLocalizedDescription=Compute function exceeds available temporary registers}
llama_new_context_with_model: failed to initialize Metal backend

if
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_FLASH_ATTN_EXT_F16_H256, flash_attn_ext_f16_h256, false);

work fine.

Yes, for head size = 256 the Metal kernels are very slow. I suspected that it has something to do with running out of register and now this error confirms it. Btw, how do you make the error show up - it never does on my mac?

I have a mac on intel, can't check there. I get the error when running on iPhone 12 pro. And it does not depend on the model, because the error occurs at the stage of resource allocation.

I encounter the same issue when I try to deploy llama.cpp on iPhone 14. The issue can be bypassed by commenting out all flash_attn related kernels in ggml-metal.metal, as suggested in this issue report.

I'd like say the issue may be problematic since it's hard to catch it on CI. The code passes compilation and only throws error at runtime. Running on simulator does not help since Apple explicitly says that simulator does not match real hardware in GPU capability. In fact, running llama.cpp on iPhone simulator throws another error saying "more than 14 constant buffer is not supported". It seems the only way to expose it is to run on a real iPhone.

Btw, how do you make the error show up - it never does on my mac?

The issue does not exist on my M2 Mac mini, either. It's specific to iPhone, not Mac.

I've disabled the HS=256 kernel from the build

I confirm the newest master (commit 0e8d8bfd6caf1d0a8cbdf9d3d5c06fbbb9dfced8) works on my iPhone 14, no longer throw "Compute function exceeds available temporary registers". Thanks for the fix.