request specific response length

Question

request specific response length

eliot1785 opened this issue 10 months ago · 4 comments

Hi, I just started playing around with this. It seems pretty awesome although the speed was possibly an issue. It seems like it's making a round trip per word or sometimes multiple times per word. It seems like it would be faster if we could just say something like, "Give me 3 sentences in response" and have it return all 3 sentences at once. Is that something that is currently possible?

Thanks.

Answer 1 · 2023-10-18T21:19:14.000Z

Hi @eliot1785. What model are you using and how are you calling it? If you share some code I'd be happy to take a look.

Answer 2 · 2023-10-18T21:28:31.000Z

I just ran this:

import replicate
output = replicate.run(
    "meta/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
    input={"prompt": "What is property casualty insurance?"}
)
# The meta/llama-2-70b-chat model can stream output as it's running.
# The predict method returns an iterator, and you can iterate over that output.

sentences = 0
for item in output:
    print(item, end='')
    if '.' in item:
        sentences += 1
    if sentences == 3:
        break

Answer 3 · 2023-10-18T21:36:07.000Z

@eliot1785 Thanks for clarifying. That model is generating output token by token. The latency introduced by the network requests should be negligible, so returning all of the output at once wouldn't make it finish any sooner.

A few tips for solving your use case:

Llama 2 and other models have a max_new_tokens parameter that you can tune to limit how many tokens are generated. When you break, the prediction continues to run until completion, so you can tune this parameter to optimize for cost.
Checking for "." is a fragile way to test for the end of a sentence, and will create false positives for tokens like "Dr.". Instead, you can use spaCy or other NLP tools to tokenize output into sentences more accurately.
- You can also use prompt engineering to change how results are generated. For example, prompting for "Answer in three sentences or less."

Answer 4 · 2023-11-07T14:51:39.000Z

@eliot1785 I'm going to close this issue for now. Please let me know if you have any more questions. Thanks!