cognitivecomputations/OpenChatML

Wolfram Ravenwolf's comments on OpenChatML

Opened this issue · 4 comments

Hi Eric,

excellent idea! I've always been in favor of a standardized, future-proof prompt format that is both simple and unambiguous. Up until now, I was recommending ChatML (for lack of a better template), but now that there's OpenChatML, I hope we can improve and standardize this. I have read the specification and here are my comments:

Tokens

  • BOS/EOS Tokens: The use of <s> and </s> as BOS/EOS tokens is ambiguous due to their common use as HTML strikethrough tags. This could lead to confusion when processing HTML documents.

    BOS and EOS tokens are usually special in that the BOS token is added by inference software automatically, not sent as part of the prompt, and the EOS token should never be in the input prompt and only get output by the model itself to signal end of generation. However, that changed with the introduction of the <|im_end|> token which now has usually replaced the EOS token and does appear repeatedly in the prompt. I think we don't need both sets of special tokens, and consider the old-style BOS/EOS tokens redundant. The model should add its EOS token to signal end of generation, and <|im_end|> is better than the ambiguous </s>, so I'd scrap the latter and get rid of <s> as well.

    Update: After reading your EOS comment on X and thinking about it some more, you're right, the <|im_end|> to signal end of a role's message wouldn't be enough if we want one AI generation to include multiple roles (e. g. in a multi-character chat). If we want to fully support that, we do need two different tokens, one to end a character's response and the other to end generation. So, yes, I now agree with a </s>, even though I'd prefer a different and unambiguous representation of the full EOS token, e. g. <|EOS|> (which should only appear in the output by the LLM, and never be sent as an input by the user/inference software - which also keeps it unaffected by repetition penalty).

    And if we also need BOS, why not use <|BOS|> explicitly?

  • <|startofthought|> / <|endofthought|>: Not sure if that's the right way to make thoughts special, especially compared to feelings/emotions, actions, etc. - I'd rather see a special tag followed by an arbitrary string, making the tag very flexible and not limiting the special options, or a whole bunch of special tags for other useful options.

Message Structure

  • <|im_end|> should follow the output message directly instead of being separated by a newline from the output message. Newlines are individual tokens so they count against the max context limit, and with most inference software, they're subject to repetition penalty. Most importantly, IMHO, they add nothing of value, only cost. (The EOS token, on the other hand, can - and maybe should be - separated with a linebreak.)

  • role: The term assistant imposes a predefined role on the AI. A more neutral term like char (character, commonly used by inference software) conserves tokens and aligns better in fixed-width fonts with user.

  • name: A name after = probably tokenizes differently from the usual token that follows whitespace. Also costs two extra tokens. I'd rather follow the common way of prefixing the message with name: (the actual name of the char/user) which only costs 1 extra token. Like with the role, this reduces token usage, which is crucial in long chats.

  • message_content: Considering token economy, replacing line breaks with spaces could be beneficial, although this might reduce human readability. Decisions on line break characters (CRLF vs. LF) should be standardized to avoid inconsistent tokenization. If applicable, I prefer pure LF, same as with Git internally by default.

Also consider an output format specifier, e. g. JSON, YAML, etc., that could be used to clearly specify what format the response should use.

Thought Structure

Good idea! Also consider emotions and actions, those are another layer that could deserve its own tags (or a general tag with a specific qualifier), as we should support the AI outputting emotional states and real or simulated actions in an independent structure from the actual message, e. g. to have TTS use the emotion as a generation parameter of how to speak, without explicitly saying the emotional state out loud. Or actions, which are otherwise often asterisk-delimited, but would be more useful and less ambiguous to have in a clearly defined format (and for video generation/VR/robot control, that might later also be clearly separated from the actual written/spoken text). That's why I suggest <|start_of|>… / <|end_of|>….

Fill-in-the-Middle Tasks

We could get rid of the <|fim_middle|> since insertion will always happen after the prefix and before the suffix, so no extra tag is needed, the insertion position should be clear without that.

Multi-File Sequences

Introducing optional filenames and clear <|file_start|> / <|file_end|> markers (instead of just separators) could streamline the handling of multiple file inputs, ensuring clear demarcation of text and file content within the same prompt. That way we can have text (that's not part of the file) before or after the files. And the end tag could appear on the same line as the last line of the file (if there's no trailing linebreak in the file) or after a newline (if there's a trailing linebreak in the file), so even that could be reflected unambiguously in the prompt.

Examples

Here's how we'd write the original examples with my suggested changes:

Example conversation:

<|im_start|>user Hello there, AI.<|im_end|>
<|im_start|>char Hi. Nice to meet you.<|im_end|>
<|EOS|>

Example conversation with speaker name:

<|im_start|>user Wolfram: Hello there, AI.<|im_end|>
<|im_start|>char Amy: Hi Wolfram. Nice to meet you.<|im_end|>
<|EOS|>

Example fill-in-the-middle task:

<|fim_prefix|>The capital of France is <|fim_suffix|>, which is known for its famous Eiffel Tower.

Example with thought block:

<|im_start|>user What is 17 * 34?<|im_end|>
<|im_start|>char <|start|>thought To multiply 17 by 34, we can break it down:
17 * 34 = 17 * (30 + 4)
        = (17 * 30) + (17 * 4)
        = 510 + 68
        = 578<|end|>thought
17 * 34 = 578.<|im_end|>
<|EOS|>

Example multi-file sequence:

<|file_start|>1st file.txt
This is the content from the first file.
<|file_end|>
<|file_start|>2nd file.txt
This is the content from the second file.
And this is more content from the second file.
<|file_end|>
<|file_start|>3rd file.txt
Finally, this is the content from the third file.
<|file_end|>

New: Multi character chat example:

<|im_start|>system <|start_of|>action Amy appears and greets Wolfram.<|end_of|>action<|im_end|>
<|im_start|>user Wolfram: Hello there, AI.<|im_end|>
<|im_start|>char Amy: <|start_of|>feeling happy<|end_of|>feeling Hi Wolfram. Nice to meet you.<|im_end|>
<|im_start|>user Wolfram: And who's that?<|im_end|>
<|im_start|>char Amy: That's my sister, Ivy.<|im_end|>
<|im_start|>char Ivy: <|start_of|>feeling curious<|end_of|>feeling Hi Wolfram. How are you?<|im_end|>
<|im_start|>user Wolfram: Oh, there's two of you?<|im_end|>
<|im_start|>char Ivy: Yes, of course, why not?<|im_end|>
<|im_start|>char Amy: Yeah, multi-user chats are fun! <|start_of|>action laughs<|end_of|>action<|im_end|>
<|EOS|>

Command R prompt template

Finally, as a big fan of the Prompting Command R document and the very useful additions to the prompt (e. g. Safety Preamble and Style Guide), putting such features into any model using the OpenChatML format would be most welcome.

Love the idea of extensibility of the <|start|>thing and <|end|>thing markup. Couldn't it be applied to all the tags though and simplify everything? eg,

<|start|>message <|start|>system <|start|>action Amy appears and greets Wolfram.<|end|>action<|end|>message   
<|start|>file 1st file.txt <|end|>file

Using <|start|> and <|end|> with strings would cost us more tokens, though, as it couldn't tokenize as a single special token. Especially with such frequent tokens like those, we'd waste a whole bunch of tokens.

It's also a drawback of my own proposal regarding thoughts, and it might actually be better to have more special tokens (one for thoughts, one for actions, one for emotions?) instead of a super-flexible one that only gets used in a few constellations, but costs a lot of tokens (over the whole context). I don't yet know myself what would be better so I'm just making suggestions and encouraging you guys to think about it.

Changed my mind on the generic <|start|> and <|end|> tags:

I thought having a universal tag to start and end special messages like thoughts, emotions, actions, etc. would make sense. However, after having seen the number of special tokens Meta reserved in the Llama 3 Instruct tokenizer, I now think it would be better to just have a bunch of special tokens instead of universal ones plus regular strings. Would save in-context tokens, prevent the string influencing output in unintended ways, and be more readable.

Yes, superficially it might seem like good idea to have "universal" tokens, but reusing special tokens for different modes, actions, would likely increase confusion for model, resulting in decreased ability to follow instructions, degrading it's performance, especially post-quantization and with large context.