Multi-threaded mode?
cvinker opened this issue · 2 comments
cvinker commented
import torch, gc
from transformers import AutoTokenizer, OPTForCausalLM
tokenizer = AutoTokenizer.from_pretrained("galactica-30b")
tokenizer.pad_token_id = 1
tokenizer.padding_side = 'left'
tokenizer.model_max_length = 2020
model = OPTForCausalLM.from_pretrained("galactica-30b")
input_text = """# Scientific article.
title: Purpose of Humanity's continued existence alive.
# Introduction
"""
input_ids = tokenizer(input_text, return_tensors="pt", padding='max_length').input_ids
outputs = model.generate(input_ids,
max_new_tokens=1000,
do_sample=True,
temperature=0.7,
top_k=25,
top_p=0.9,
no_repeat_ngram_size=10,
early_stopping=True)
print(tokenizer.decode(outputs[0]).lstrip('<pad>'))
gc.collect()
torch.empty_cache()
When I run this, I can see it loads the model into ram; it seems only to be using one thread. The output is: a wall of various 'decoder.layers.xx.bias' and "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."
cvinker commented
Ok, I was able to get it to work properly with the 6.7b model I don't think I need the: torch.empty_cache()
Also, it does seem to be using multi-threading.
cvinker commented