getalp/Flaubert

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 237414383616 bytes. Error code 12 (Cannot allocate memory)

keloemma opened this issue · 3 comments

Environment info

Environment info

  • transformers version: 2.5.1
  • Platform: linux
  • Python version: 3.7
  • PyTorch version (GPU?): 1.4
  • Tensorflow version (GPU?):
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

Model I am using (FlauBert):

The problem arises when trying to produce features with the model, the output which is generated causes the system run out of memory.

  • the official example scripts: (I did not change much , pretty close to the original)
import torch
from transformers import FlaubertModel, FlaubertTokenizer
# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 
#               'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased' 

# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones

sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)

# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]
  • My own modified scripts: (give details below)
def get_flaubert_layer(texte):

	modelname = "flaubert-base-uncased"
	path = './flau/flaubert-base-unc/'

	flaubert = FlaubertModel.from_pretrained(path)
	flaubert_tokenizer = FlaubertTokenizer.from_pretrained(path)
	tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512)))
	max_len = 0
	for i in tokenized.values:
		if len(i) > max_len:
			max_len = len(i)
	padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
	token_ids = torch.tensor(padded)
	with torch.no_grad():
		last_layer = flaubert(token_ids)[0][:,0,:].numpy()
		
	return last_layer, modelname

The tasks I am working on is:

  • Producing vectors/features from a language model and pass it to others classifiers

To reproduce

Steps to reproduce the behavior:

  1. Get transformers library and scikit-learn, pandas and numpy, pytorch
  2. Last lines of code
# Reading the file 
filename = "corpus"
sentences = pd.read_excel(os.path.join(root, filename + ".xlsx"), sheet_name= 0)
data_id = sentences.identifiant
print("Total phrases: ", len(data_id))
data = sentences.sent
label = sentences.etiquette
emb, mdlname = get_flaubert_layer(data)  # corpus is dataframe of approximately 40 000 lines

Apperently this line produce something which is huge and which take a lot memory :
last_layer = flaubert(token_ids)[0][:,0,:].numpy()

I would have expected it run but I think the fact that I pass the whole dataset to the model is causing the system to break, so I wanted to know if it possible to tell the model to process the data set maybe 500 lines or 1000 lines at at a time so as to not pass the whole dataset. I know that , there is this parameter : batch_size which can be used but since I am not training a model but merely using it to produces embeddings as input for others classifiers ,
Do you perhaps know how to modify the batch size so the whole dataset is not treated. I am not really familiar with this type of architecture. In the example , they just put one single sentence but in my case I load a whole dataset (dataframe). ?

My expectation is to make the model treat all the sentences and then produced the vectors I need for the task of classification.

I found the solution.

May you indicate what was the problem ? (For people who could experiment the same problem). Thanks in advance,

It was a problem link to "insufficient memory or space when using the model". I passed along small batches to the model to avoid this error: and i create a loop that goes over the I in range(0, len(padded), batch_size) and passes along the padded[i: i+batch_size] to the model, then concatenated the predictions back together.