databrickslabs/dolly

Training on 500k data

jawadSajid opened this issue · 1 comments

Hi,

I am training dolly on 500k data on pythia-1.4b, I have trained it before on around 20k data.
For 500k data, my I get OOM Error after a certain training time.

These are my training args:

    training_args = TrainingArguments(
        output_dir="path",
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        fp16=True,
        bf16=False,
        learning_rate=5e-5,
        gradient_accumulation_steps=2,
        num_train_epochs=1,
        logging_dir="./path/runs",
        logging_strategy="steps",
        logging_steps=20,
        evaluation_strategy="steps",
        save_strategy="steps",
        save_total_limit=10,
        load_best_model_at_end=True,
        dataloader_num_workers=8,
    )

I load my dataset as follows:

dataset = load_dataset("json", data_files=path_or_dataset, num_proc=4)
dataset = dataset['train']

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0

python 3.8.10 pip freeze

absl-py==1.4.0
accelerate==0.19.0
aiohttp==3.8.4
aiosignal==1.3.1
async-timeout==4.0.2
attrs==23.1.0
bitsandbytes==0.38.1
cachetools==5.3.0
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.3
dataclasses-json==0.5.7
datasets==2.12.0
dill==0.3.6
filelock==3.12.0
frozenlist==1.3.3
fsspec==2023.5.0
google-auth==2.18.0
google-auth-oauthlib==1.0.0
greenlet==2.0.2
grpcio==1.54.2
hjson==3.1.0
huggingface-hub==0.14.1
idna==3.4
importlib-metadata==6.6.0
Jinja2==3.1.2
lit==16.0.3
Markdown==3.4.3
MarkupSafe==2.1.2
marshmallow==3.19.0
marshmallow-enum==1.5.1
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==1.0.0
networkx==3.1
ninja==1.11.1
numexpr==2.8.4
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
openapi-schema-pydantic==1.2.4
packaging==23.1
pandas==2.0.1
protobuf==4.23.0
psutil==5.9.5
py-cpuinfo==9.0.0
pyarrow==12.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.7
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0
regex==2023.5.5
requests==2.30.0
requests-oauthlib==1.3.1
responses==0.18.0
rsa==4.9
six==1.16.0
SQLAlchemy==2.0.13
sympy==1.12
tenacity==8.2.2
tensorboard==2.13.0
tensorboard-data-server==0.7.0
tokenizers==0.13.3
torch==2.0.1
tqdm==4.65.0
transformers==4.29.1
triton==2.0.0
typing-extensions==4.5.0
typing-inspect==0.8.0
tzdata==2023.3
urllib3==1.26.15
Werkzeug==2.3.4
xxhash==3.2.0
yarl==1.9.2
zipp==3.15.0

GPUs

8 Tesla K80 12 GB.

Traceback:

  File "trainer.py", line 336, in <module>
    main()
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "trainer.py", line 328, in main
    train(**kwargs)
  File "trainer.py", line 282, in train
    trainer.train()
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/transformers/trainer.py", line 2745, in training_step
    self.scaler.scale(loss).backward()
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/mnt/dsc-nfs-1/jawad-sajid/DOLLY/dolly-env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 5; 11.17 GiB total capacity; 7.17 GiB already allocated; 948.25 MiB free; 9.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

What could be going wrong? It seems like dataloader keeps on utilizing GPU memory and never frees it, ultimately which throws OOM.

You have 12GB GPUs, which is small. But then again the model isn't that big. It looks like you aren't using the code, model or data in this repo, so I am not sure this is the right place to ask. Typical advice is to lower your batch size. See also the training settings in the README that help lower memory requirements.