icppWorld/icpp_llm

Increase length of stories for deployed tinyStories15M.bin model deployed to the IC

Closed this issue · 6 comments

When you ask for a story longer than 20 steps, the canister call will fail with this error message:

dfx canister call --network ic llama2 inference '(record {prompt = "" : text; steps = 30 : nat64; temperature = 0.9 : float32; topp = 0.9 : float32;})'
Error: Failed query call.
Caused by: Failed query call.
  The replica returned a replica error: Replica Error: reject code CanisterError, reject message IC0522: Canister 4c4bn-daaaa-aaaag-abvcq-cai exceeded the instruction limit for single message execution., error code Some("IC0522")
  • When calling without a prompt, you can get up to 20 steps
  • When calling with a prompt, it is much smaller. It seems that the prompt processing is taking a lot of instructions.

The purpose of this feature request is to increase the length of the stories that can be generated by the canister.

You should probably switch from a query to an update call for inference. Update calls have larger instruction limits due to Deterministic Time Slicing and you can make use of self-calls to start a new message execution.

See also https://internetcomputer.org/docs/current/developer-docs/backend/resource-limits

@domwoe ,
Thank you for that guidance. I am making really good progress by using a sequence of update calls from the client, and then saving the so called runstate of the LLM in Orthogonal Persistence, to be used across calls. OP is so great!

Do you have some more details about the concept of self calls?

Using the self call approach is actually not needed and might not even be the best approach. It works really nice by giving the responsibility of doing a sequence of update calls to the client. That gives the client the ability to provide a streaming style output back to the user after each update call is completed.

You can see how it works in this forum post

Implemented as part of #8

Awesome! Yes, client-side orchestration works as well. It might be a bit slower (due to roundtrip) and costlier (ingress messages are more expensive than inter-canister calls) though.

Great points! Thanks for that explanation.

In light of those benefits I will be looking at options to also support on-chain orchestration. I will open a separate issue for that.