RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 237414383616 bytes. Error code 12 (Cannot allocate memory)
keloemma opened this issue · 3 comments
Environment info
Environment info
transformers
version: 2.5.1- Platform: linux
- Python version: 3.7
- PyTorch version (GPU?): 1.4
- Tensorflow version (GPU?):
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
Model I am using (FlauBert):
The problem arises when trying to produce features with the model, the output which is generated causes the system run out of memory.
- the official example scripts: (I did not change much , pretty close to the original)
import torch
from transformers import FlaubertModel, FlaubertTokenizer
# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased',
# 'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased'
# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones
sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])
last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768]) -> (batch size x number of tokens x embedding dimension)
# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]
- My own modified scripts: (give details below)
def get_flaubert_layer(texte):
modelname = "flaubert-base-uncased"
path = './flau/flaubert-base-unc/'
flaubert = FlaubertModel.from_pretrained(path)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(path)
tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512)))
max_len = 0
for i in tokenized.values:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
token_ids = torch.tensor(padded)
with torch.no_grad():
last_layer = flaubert(token_ids)[0][:,0,:].numpy()
return last_layer, modelname
The tasks I am working on is:
- Producing vectors/features from a language model and pass it to others classifiers
To reproduce
Steps to reproduce the behavior:
- Get transformers library and scikit-learn, pandas and numpy, pytorch
- Last lines of code
# Reading the file
filename = "corpus"
sentences = pd.read_excel(os.path.join(root, filename + ".xlsx"), sheet_name= 0)
data_id = sentences.identifiant
print("Total phrases: ", len(data_id))
data = sentences.sent
label = sentences.etiquette
emb, mdlname = get_flaubert_layer(data) # corpus is dataframe of approximately 40 000 lines
Apperently this line produce something which is huge and which take a lot memory :
last_layer = flaubert(token_ids)[0][:,0,:].numpy()
I would have expected it run but I think the fact that I pass the whole dataset to the model is causing the system to break, so I wanted to know if it possible to tell the model to process the data set maybe 500 lines or 1000 lines at at a time so as to not pass the whole dataset. I know that , there is this parameter : batch_size which can be used but since I am not training a model but merely using it to produces embeddings as input for others classifiers ,
Do you perhaps know how to modify the batch size so the whole dataset is not treated. I am not really familiar with this type of architecture. In the example , they just put one single sentence but in my case I load a whole dataset (dataframe). ?
My expectation is to make the model treat all the sentences and then produced the vectors I need for the task of classification.
I found the solution.
May you indicate what was the problem ? (For people who could experiment the same problem). Thanks in advance,
It was a problem link to "insufficient memory or space when using the model". I passed along small batches to the model to avoid this error: and i create a loop that goes over the I in range(0, len(padded), batch_size) and passes along the padded[i: i+batch_size] to the model, then concatenated the predictions back together.