Translation scripts stops after a few minutes
Opened this issue ยท 7 comments
I have not used this script, but OpenAI is having issues with their API service for the last few days, including today.
No, right now I wait until the AlpacaDataCleaned progresses further, then I'll try again.
Exceptions blocking the task execution is the most probable cause of this issue. The ratelimits are per account, typically very small 90000 tokens / minute. Start with 10 threads. Had this problem for myself and traced it.
You can add simple handling for to translate_text. I'm not sure are ratelimited requests billed in openai's API. I suggest testing the translation script with a smaller chunk. 0-100 for example.
Here is what you need to catch openai.error.RateLimitError
def translate_text(value):
try:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Translate the following text to Finnish: '{value}'"},
],
max_tokens=1024,
temperature=0,
)
return response.choices[0]["message"]["content"].strip()
except openai.error.RateLimitError as e:
print("Rate limit error: ", e)
pass
This will lead to null
returned instead of the translated string so you may prefer rewriting the code to retry the request or to skip the whole prompt from being appended if one of the instructions is null
.
But please check the billing terms of how execption request are handled before adding retry to the requests in openai first, this may become very expensive other vice.
Retrying the request with backoff on exception as per openai's official documentation which I've implemented and using now.
import backoff
...
@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def translate_text(value):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Translate the following text to Finnish: '{value}'"},
],
max_tokens=1024,
temperature=0,
)
return response.choices[0]["message"]["content"].strip()
I'm being ratelimited so much that even using only 10 threads it takes ~5 minutes to translate 350 complete instruction prompts with system, user and output. Insane.
Br;
Thanks for the analysis, very interesting. Maybe just paying for the Google Translate or DeepL API instead could be a better option?
Thanks for the analysis, very interesting. Maybe just paying for the Google Translate or DeepL API instead could be a better option?
Hi, I'm unfamiliar with the these other translate services, but I'll surely look them up. I think if any of them have by default better rate limits they are better worth if they are priced accordingly. For me 1000 alpaca-lora cleaned promprts was around $0.75-1 with openai while debugging this code, but the rate-limit is killing producitvity to accomplish this in the matter of time I could with threads.
I created a whole rewrite of the code that now translates the whole prompt object string values, but not the keys and is tuned for performance. This is still slow 100-150/minute, due to spurious rate-limits on openai "model busy"/"ratelimit reached" then falling to exponential backoff timer.
import openai
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
# from tqdm import tqdm
import backoff
# Replace 'your_api_key' with your actual API key
openai.api_key = "your-key-here"
def backoff_hdlr(details):
print(
"Backing off {wait:0.1f} seconds after {tries} tries "
"calling function {target} with args {args} and kwargs "
"{kwargs}".format(**details)
)
@backoff.on_exception(
backoff.expo, openai.error.RateLimitError, max_tries=50, on_backoff=backoff_hdlr
)
def translate_text(value):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "Translate the following JSON objects string values into Finnish, do not translate string "
"keys. No english string values are accepted.",
},
{
"role": "user",
"content": f"{json.dumps(value)}'",
},
],
max_tokens=1024,
temperature=0,
)
return response.choices[0]["message"]["content"].strip()
def translate_item(item):
translated_item = translate_text(item)
return translated_item
# Maximum number of parallel requests
MAX_PARALLEL_REQUESTS = 25
# Assuming the input JSON is in a file named 'input.json'
with open("../data/alpaca_data_cleaned_archive.json", "r") as f:
data = json.load(f)
lock = Lock()
start = 0
end = 51785
translated_data = []
data = data[start:end]
tasks_completed = 0
# simple progress indicator callback function
def progress_indicator(future):
global lock, tasks_completed
# obtain the lock
with lock:
try:
translated_data.append(json.loads(future.result()))
# update the counter
tasks_completed += 1
# report progress
print(
f"{tasks_completed}/{len(futures)} completed, {len(futures) - tasks_completed} remain."
)
# if tasks_completed is divisable by 1000 save snapshot
if tasks_completed % 1000 == 0:
# Save the translated data to a new JSON file named 'translated_data.json'
with open(f"translated_data_up_to_{start}_to_{end}.json", "w") as f:
json.dump(translated_data, f, ensure_ascii=False, indent=4)
print(
f"Translation snapshot is saved in 'translated_data_from_{start}_to_{tasks_completed}.json'"
)
except Exception as e:
print(f"exception: {e}")
pass
with ThreadPoolExecutor(max_workers=MAX_PARALLEL_REQUESTS) as executor:
futures = {executor.submit(translate_item, item): item for item in data}
for future in futures:
future.add_done_callback(progress_indicator)
# Finally save the translated data to a new JSON file named 'translated_data.json'
with open(f"translated_data_up_to_{start}_to_{end}.json", "w") as f:
json.dump(translated_data, f, ensure_ascii=False, indent=4)
print(
f"Translation complete. The translated data is saved in 'translated_data_from_{start}_to_{end}.json'"
)
DeepL API https://www.deepl.com/pro#developer
- Pay as you go (โฌ20.00 per 1,000,000 characters)
--
โ wc -c alpaca_data_cleaned_archive.json
22,680,910 alpaca_data_cleaned_archive.json
This is actually much more expensive : D