Training on 500k data
jawadSajid opened this issue · 1 comments
Hi,
I am training dolly on 500k data on pythia-1.4b, I have trained it before on around 20k data.
For 500k data, my I get OOM Error after a certain training time.
These are my training args:
training_args = TrainingArguments(
output_dir="path",
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
fp16=True,
bf16=False,
learning_rate=5e-5,
gradient_accumulation_steps=2,
num_train_epochs=1,
logging_dir="./path/runs",
logging_strategy="steps",
logging_steps=20,
evaluation_strategy="steps",
save_strategy="steps",
save_total_limit=10,
load_best_model_at_end=True,
dataloader_num_workers=8,
)
I load my dataset as follows:
dataset = load_dataset("json", data_files=path_or_dataset, num_proc=4)
dataset = dataset['train']
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0
python 3.8.10 pip freeze
absl-py==1.4.0
accelerate==0.19.0
aiohttp==3.8.4
aiosignal==1.3.1
async-timeout==4.0.2
attrs==23.1.0
bitsandbytes==0.38.1
cachetools==5.3.0
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.3
dataclasses-json==0.5.7
datasets==2.12.0
dill==0.3.6
filelock==3.12.0
frozenlist==1.3.3
fsspec==2023.5.0
google-auth==2.18.0
google-auth-oauthlib==1.0.0
greenlet==2.0.2
grpcio==1.54.2
hjson==3.1.0
huggingface-hub==0.14.1
idna==3.4
importlib-metadata==6.6.0
Jinja2==3.1.2
lit==16.0.3
Markdown==3.4.3
MarkupSafe==2.1.2
marshmallow==3.19.0
marshmallow-enum==1.5.1
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==1.0.0
networkx==3.1
ninja==1.11.1
numexpr==2.8.4
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
openapi-schema-pydantic==1.2.4
packaging==23.1
pandas==2.0.1
protobuf==4.23.0
psutil==5.9.5
py-cpuinfo==9.0.0
pyarrow==12.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.7
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0
regex==2023.5.5
requests==2.30.0
requests-oauthlib==1.3.1
responses==0.18.0
rsa==4.9
six==1.16.0
SQLAlchemy==2.0.13
sympy==1.12
tenacity==8.2.2
tensorboard==2.13.0
tensorboard-data-server==0.7.0
tokenizers==0.13.3
torch==2.0.1
tqdm==4.65.0
transformers==4.29.1
triton==2.0.0
typing-extensions==4.5.0
typing-inspect==0.8.0
tzdata==2023.3
urllib3==1.26.15
Werkzeug==2.3.4
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0
GPUs
8 Tesla K80 12 GB.
Traceback:
File "trainer.py", line 336, in <module>
main()
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "trainer.py", line 328, in main
train(**kwargs)
File "trainer.py", line 282, in train
trainer.train()
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/transformers/trainer.py", line 2745, in training_step
self.scaler.scale(loss).backward()
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 5; 11.17 GiB total capacity; 7.17 GiB already allocated; 948.25 MiB free; 9.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
What could be going wrong? It seems like dataloader keeps on utilizing GPU memory and never frees it, ultimately which throws OOM.
You have 12GB GPUs, which is small. But then again the model isn't that big. It looks like you aren't using the code, model or data in this repo, so I am not sure this is the right place to ask. Typical advice is to lower your batch size. See also the training settings in the README that help lower memory requirements.