replicate/replicate-python

LLama3 streaming repeats the previous request's first token.

Closed this issue · 8 comments

Hi! I'm running into a problem of repeating the first token in subsequent requests using a stream. The prompt structure follows the Meta LLama3 documentation. Could you explain why is this going on?

Simple chat example output looks in this way:

The model name is meta/meta-llama-3-70b-instruct

You: Hi!
Assistant: Hi! How can I help you today?

You: Recommend me a Hemingway novel, please.
Assistant: Hi
I'd recommend "The Old Man and the Sea". It's a classic, concise, and powerful novel that showcases Hemingway's unique writing style.

You: I read it, please recommend something else.
Assistant: Hi
I
How about "A Farewell to Arms"? It's a romantic and tragic novel set during WWI, and it's considered one of Hemingway's best works.

You: It's great! Thank you! Bye!
Assistant: Hi
I
How
You're welcome! I'm glad you enjoyed the recommendation. Have a great day and happy reading! Bye!

Example code:

import os
from replicate.client import Client

replicate_api_key = os.getenv("REPLICATE_API_TOKEN", 'EMPTY')
replicate_model = os.getenv('REPLICATE_MODEL', 'meta/meta-llama-3-70b-instruct')
replicate_client = Client(api_token=replicate_api_key)

SYSTEM_PROMPT = 'You are a helpful assistant. Answer briefly!'
MESSAGES = []


def gen_llama3_prompt(sys_prompt=None, messages=None):
    sys_prompt = '' if sys_prompt is None else sys_prompt
    messages = [] if messages is None else messages
    _result = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{sys_prompt}<|eot_id|>"
    for m in messages:
        if m['role'] == 'user':
            _result += f'<|start_header_id|>user<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
        elif m['role'] == 'assistant':
            _result += f'<|start_header_id|>assistant<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
    _result += '<|start_header_id|>assistant<|end_header_id|>\n\n'
    return _result


def print_answer(query=''):
    message = {'role': 'user', 'content': query}
    answer = ''
    MESSAGES.append(message)
    for event in replicate_client.stream(
            "meta/meta-llama-3-70b-instruct",
            input={
                "top_p": 1e-5,
                "prompt": gen_llama3_prompt(SYSTEM_PROMPT, MESSAGES),
                "max_tokens": 512,
                "min_tokens": 0,
                "temperature": 1e-6
            }):
        token = str(event)
        answer += token
        print(token, end='')
    message = {'role': 'assistant', 'content': answer}
    MESSAGES.append(message)


if __name__ == '__main__':
    print(f'Model name is {replicate_model}')
    while True:
        q = input('\nYou: ')
        print('Assistant: ', end='')
        print_answer(q)
        if 'bye' in q.lower():
            break

Thanks for your help!

Hi @mikutsky. Thanks for reporting this. Can you share any predictions for these? (Go to your replicate.com Dashboard, look under Predictions). Seeing that would help us tell if the problem is in the model or the client library.

Hi, have the same issue

@Gusakovskyi @mikutsky We've confirmed that there's an issue with stop sequences for meta/meta-llama-3-70b-instruct, and we're working on a fix.

Hi @mikutsky. Thanks for reporting this. Can you share any predictions for these? (Go to your replicate.com Dashboard, look under Predictions). Seeing that would help us tell if the problem is in the model or the client library.

It looks like the client library problem. I provide you second query info. Because the next queries collect mistakes in the prompt.

Everything looks correct on the dashboard:
iScreen Shoter - Google Chrome - 240419235931

Here is the prompt for the second query, and the prompt is still correct:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant. Answer briefly!<|eot_id|><|start_header_id|>user<|end_header_id|>

Hi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi! How can I help you today?<|eot_id|><|start_header_id|>user<|end_header_id|>

I read it, please recommend something else.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

However console output contains the extra tag 'Hi':

You: >? Hi!
Assistant: Hi! How can I help you today?

You: >? I read it, please recommend something else.
Assistant: Hi
I'd be happy to! However, I need a bit more information. What type of content are you in the mood for? A book, article, podcast, or something else?

@mikutsky We just pushed a new build of the model, which should address the stop sequence problem. Please give your client code another try and let me know if that's working for you now.

If not, could you please try calling replicate.stream in isolation? I'd like to rule out the use of input and accessing mutable state in a loop, even though that should be running synchronously and not be a problem.

Actually, I'm able to reproduce this in isolation, so it does appear to be an issue with the client. Working on a fix now.

@mikutsky @Gusakovskyi Thanks again for reporting. This should be fixed by 0.25.2.

Please let me know if you continue to see this behavior.

@mikutsky @Gusakovskyi Thanks again for reporting. This should be fixed by 0.25.2.

Please let me know if you continue to see this behavior.

Thanks a lot! It works!