fermyon/spin

LLM WIT World Updates

benbrandt opened this issue · 0 comments

Hi everyone. Looking at the LLM WIT World in Spin, there are some extra properties and parameters that would be great to see in there. Overall, I think what you have is a great start. I don't have much insight into the underlying inference stack to know if the following could be supported, but they are usually standard offerings, so I hope it doesn't require too much effort, and allows for some new use cases and control flows in AI apps.

I am happy to discuss any of these further. There are some other things that would be interesting to add, but could be added as "minor" releases, so I figured I would start with the ones that would require breaking changes.

Stop Sequences and Optional Max Tokens

For a lot of use cases, there is a structured pattern that the model should generate, such as chat messages. With this behavior, it is useful to be able to stop token generation once specific tokens are reached, because it means that the model generated the expected next output. For example, in a chat, you can stop once the model starts generating the special tokens for the start of the next message.

Because it can be difficult to gauge how many tokens would be necessary to generate the next expected output, it can be helpful to not pass in max-tokens and let the model generate as many tokens as necessary until a stop sequence is reached. This assumes the backing host implementation allows for this, and defaults to the maximum tokens the model can generate, or the remaining context window, whichever is smaller.

If the backing implementation still requires max tokens, either the host implementation can calculate this for the user, or else it would still need to be required. But by setting stop sequences, the user can still stop the model from generating tokens once it reaches a specific sequence, which can save on inference costs and time.

/// Inference request parameters
record inferencing-params {
	/// The maximum tokens that should be inferred.
	///
	/// Note: the backing implementation may return less tokens.
	max-tokens: option<u32>,
	/// The amount the model should avoid repeating tokens.
	repeat-penalty: float32,
	/// The number of tokens the model should apply the repeat penalty to.
	repeat-penalty-last-n-token-count: u32,
	/// A list of sequences that, if encountered, the API will stop generating further tokens.
	stop: list<string>,
	/// The randomness with which the next token is selected.
	temperature: float32,
	/// The number of possible next tokens the model will choose from.
	top-k: u32,
	/// The probability total of next tokens the model will choose from.
	top-p: float32
}

Finish Reason

Once you allow for two ways to stop the model from generating tokens, it can be useful to know why the model stopped. Did it hit an expected end of sequence, either end-of-text token, or a stop sequence? Or did it stop because it ran out of tokens? If it ran out of tokens, perhaps it was because the prompt was too long and needs to be condensed, or the model was generating output that doesn't tokenize as "densely" as expected. Either way, this information can be useful for debugging as well as potentially making other decisions about the control flow of the use case.

content-filter is a reason that shows up in OpenAI's api, and could be added if desired for "completeness", but could also be ignored for now, since I don't know if you will be applying content filters to the model output in your implementation.

/// The reason the model finished generating
enum finish-reason {
	/// The model hit a natural stopping point or a provided stop sequence
	stop,
	/// The maximum number of tokens specified in the request was reached
	length,
	/// Content was omitted due to a flag from content filters
	content-filter,
}

/// An inferencing result
record inferencing-result {
	/// The reason the model finished generating
	finish-reason: finish-reason,
	/// The text generated by the model
	// TODO: this should be a stream
	text: string,
	/// Usage information about the inferencing request
	usage: inferencing-usage
}

Required Inferencing Params

Update: I found the default params and I take this back, defaulting is fine. Allowing for max tokens to be optional would still be interesting though for the reasons mentioned above.

This is the recommendation most up for debate:

Once both max-tokens and stop sequences are allowed, it is very likely that one, if not both, should be used. Only in rare cases would you desire to let the model generate as many tokens as it wants without also providing a stop sequence. In this case, it could be argued you should supply one stop reason or another, and then having params be an option makes less sense. But I assume you have some sort of default max-tokens that you might want to keep, in which case, maybe letting it stay optional is still a reasonable idea.

On the SDK side you can make this potentially nicer with builders and helpers for generating all of this. I just wanted to flag that from the WIT side, it is likely that one of these two parameters should be required to be passed in, assuming you allow the model to be more liberal in how many tokens it can generate until it hits a stop sequence. I just wouldn't offer it as the "default" way of running inference, but allow this mode to be opted into if desired.

infer: func(model: inferencing-model, prompt: string, params: inferencing-params) -> result<inferencing-result, error>;