BUG: Neptune v1.3.2 is slow when using forked data loaders in Linux
issamemari opened this issue ยท 34 comments
Describe the bug
I have a training script that uses the Trainer class from the HuggingFace Transformers library to train a text classification model. The Trainer class uses a NeptuneCallback to log training metrics to Neptune.
I noticed that my training script is inexplicably slow, so I profiled it using py-spy and found that neptune/internal/threading/daemon.py
is taking more than 90% of the execution time of my script. I tried disabling the Neptune callback and saw a dramatic improvement in the speed of the training script.
Reproduction
Here's my training code. I have the Neptune client installed and project name and API key in the environment variables so the Trainer class automatically send logs to Neptune using the NeptuneCallback class.
import logging
import hydra
from transformers import Trainer, TrainingArguments
from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor
logger = logging.getLogger(__name__)
@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
logger.info("Creating dataset")
train_dataset = LukalficationDataset(**config.dataset.train)
eval_dataset = LukalficationDataset(**config.dataset.eval)
logger.info("Creating preprocessor")
processor = Processor(
tokenizer_config=config.tokenizer,
label_encoders=[
MultiLabelOneHotEncoder(LEVEL1),
MultiLabelOneHotEncoder(LEVEL2),
MultiLabelOneHotEncoder(LEVEL3),
],
)
logger.info("Creating model")
model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)
training_args = TrainingArguments(**config.training)
logger.info("Training model")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=processor.data_collator,
compute_metrics=MetricsLogger(
label_encoders=processor.label_encoders
).compute_metrics,
)
trainer.train()
logger.info("Training complete")
if __name__ == "__main__":
main()
Expected behavior
I expect logging to Neptune to not impact the speed of my training script.
Environment
The output of pip list
:
Package Version
-------------------------- ---------
accelerate 0.20.3
aiobotocore 2.5.2
aiohttp 3.8.4
aiohttp-retry 2.8.3
aioitertools 0.11.0
aiosignal 1.3.1
amqp 5.1.1
annotated-types 0.5.0
antlr4-python3-runtime 4.9.3
appdirs 1.4.4
appnope 0.1.3
arrow 1.2.3
asn1crypto 1.5.1
asttokens 2.2.1
async-timeout 4.0.2
asyncssh 2.13.2
atpublic 4.0
attrs 23.1.0
backcall 0.2.0
backoff 2.2.1
beautifulsoup4 4.12.2
billiard 4.1.0
boto3 1.26.161
botocore 1.29.161
bravado 11.0.3
bravado-core 5.17.1
celery 5.3.1
certifi 2023.5.7
cffi 1.15.1
charset-normalizer 3.2.0
click 8.1.4
click-didyoumean 0.3.0
click-plugins 1.1.1
click-repl 0.3.0
colorama 0.4.6
comm 0.1.3
configobj 5.0.8
cryptography 40.0.2
debugpy 1.6.7
decorator 5.1.1
dictdiffer 0.9.0
diskcache 5.6.1
distro 1.8.0
dpath 2.1.6
dulwich 0.21.5
dvc 3.5.1
dvc-data 2.5.0
dvc-http 2.30.2
dvc-objects 0.23.0
dvc-render 0.5.3
dvc-s3 2.23.0
dvc-studio-client 0.11.0
dvc-task 0.3.0
executing 1.2.0
filelock 3.12.2
flatten-dict 0.4.2
flufl.lock 7.1.1
fqdn 1.5.1
frozenlist 1.3.3
fsspec 2023.6.0
funcy 2.0
future 0.18.3
gitdb 4.0.10
GitPython 3.1.32
grandalf 0.8
huggingface-hub 0.16.4
hydra-core 1.3.2
idna 3.4
ipykernel 6.24.0
ipython 8.14.0
isoduration 20.11.0
iterative-telemetry 0.0.8
jedi 0.18.2
Jinja2 3.1.2
jmespath 1.0.1
joblib 1.3.1
jsonpointer 2.4
jsonref 1.1.0
jsonschema 4.18.0
jsonschema-specifications 2023.6.1
jupyter_client 8.3.0
jupyter_core 5.3.1
kombu 5.3.1
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mdurl 0.1.2
monotonic 1.6
mpmath 1.3.0
msgpack 1.0.5
multidict 6.0.4
nanotime 0.5.2
neptune 1.3.2
nest-asyncio 1.5.6
networkx 3.1
numpy 1.25.1
oauthlib 3.2.2
omegaconf 2.3.0
orjson 3.9.2
oscrypto 1.3.0
packaging 23.1
pandas 2.0.3
parso 0.8.3
pathspec 0.11.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.0.0
pip 23.1.2
platformdirs 3.8.1
prompt-toolkit 3.0.39
protobuf 3.19.6
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 10.0.1
pycparser 2.21
pycryptodomex 3.18.0
pydantic 2.0.2
pydantic_core 2.1.2
pydot 1.4.2
pygit2 1.12.2
Pygments 2.15.1
pygtrie 2.5.0
PyJWT 2.7.0
pyOpenSSL 23.2.0
pyparsing 3.1.0
python-dateutil 2.8.2
pytz 2023.3
PyYAML 6.0
pyzmq 25.1.0
referencing 0.29.1
regex 2023.6.3
requests 2.31.0
requests-oauthlib 1.3.1
rfc3339-validator 0.1.4
rfc3987 1.3.8
rich 13.4.2
rpds-py 0.8.10
ruamel.yaml 0.17.32
ruamel.yaml.clib 0.2.7
s3fs 2023.6.0
s3transfer 0.6.1
safetensors 0.3.1
scikit-learn 1.3.0
scipy 1.11.1
scmrepo 1.0.4
sentencepiece 0.1.99
setuptools 68.0.0
shortuuid 1.0.11
shtab 1.6.2
simplejson 3.19.1
six 1.16.0
smmap 5.0.0
snowflake-connector-python 3.0.4
sortedcontainers 2.4.0
soupsieve 2.4.1
sqltrie 0.7.0
stack-data 0.6.2
swagger-spec-validator 3.0.3
sympy 1.12
tabulate 0.9.0
threadpoolctl 3.1.0
tokenizers 0.13.3
tomlkit 0.11.8
torch 2.0.1
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.30.2
typing_extensions 4.7.1
tzdata 2023.3
uri-template 1.3.0
urllib3 1.26.16
vine 5.0.0
voluptuous 0.13.1
wcwidth 0.2.6
webcolors 1.13
websocket-client 1.6.1
wheel 0.40.0
wrapt 1.15.0
yarl 1.9.2
zc.lockfile 3.0.post1
The operating system you're using:
Linux
The output of python --version
:
Python 3.11.3
Hello @issamemari ๐
Are you logging model checkpoints to Neptune? Syncing might take a while in case of large checkpoints uploaded frequently.
To check if this is really the case, can you use the callback method of logging with log_checkpoints="best"
instead of using Transformer's default report_to
? This will upload only the best model checkpoint instead of a checkpoint every 500 steps.
Be sure to update report_to
to "none"
in TrainingArguments()
to avoid creating multiple callbacks.
Using the callback method: https://docs.neptune.ai/integrations/transformers/#__tabbed_2_2
NeptuneCallback()
API reference: https://docs.neptune.ai/api/integrations/transformers/#neptunecallback
Please let me know if this works better for you ๐
I've noticed something similar after upgrading Neptune (not sure what version I was using before but it's from the last month or so) but without using Trainer
from transformers
(just PyTorch).
Hello @bryant1410 ,
Could you please create a new issue with your case's details so we can have a closer look?
I am not logging model checkpoints to Neptune, I'm just saving them to disk. I tried what you suggested, here's the code:
import logging
import hydra
from transformers import Trainer, TrainingArguments
from transformers.integrations import NeptuneCallback
from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor
logger = logging.getLogger(__name__)
@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
logger.info("Creating dataset")
train_dataset = LukalficationDataset(**config.dataset.train)
eval_dataset = LukalficationDataset(**config.dataset.eval)
logger.info("Creating preprocessor")
processor = Processor(
tokenizer_config=config.tokenizer,
label_encoders=[
MultiLabelOneHotEncoder(LEVEL1),
MultiLabelOneHotEncoder(LEVEL2),
MultiLabelOneHotEncoder(LEVEL3),
],
)
logger.info("Creating model")
model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)
training_args = TrainingArguments(**config.training)
logger.info("Training model")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=processor.data_collator,
compute_metrics=MetricsLogger(
label_encoders=processor.label_encoders
).compute_metrics,
callbacks=[NeptuneCallback(log_checkpoints="best")],
)
trainer.train()
logger.info("Training complete")
if __name__ == "__main__":
main()
And here's the config file:
model:
freeze_backbone: false
checkpoint_path: null
dataset:
train:
csv_file: ${hydra:runtime.cwd}/data/train.csv
eval:
csv_file: ${hydra:runtime.cwd}/data/eval.csv
tokenizer:
init:
pretrained_model_name_or_path: "Unbabel/xlm-roberta-comet-small"
truncation_side: "left"
call:
max_length: 512
padding: longest
truncation: true
return_tensors: "pt"
training:
output_dir: ${hydra:runtime.cwd}/checkpoints/${now:%Y-%m-%d}/${now:%H-%M}
num_train_epochs: 50
per_device_train_batch_size: 48
per_device_eval_batch_size: 48
learning_rate: 5e-5
warmup_ratio: 0.1
dataloader_num_workers: 8
lr_scheduler_type: linear
greater_is_better: true
metric_for_best_model: "eval_global/f1-score"
full_determinism: false
no_cuda: false
save_strategy: steps
save_steps: 3000
evaluation_strategy: steps
eval_steps: 3000
logging_strategy: steps
logging_steps: 100
optim: adamw_torch
remove_unused_columns: false
report_to: null
load_best_model_at_end: true
Here's the output of my profiler with the Neptune callback:
And here's the output without the Neptune callback (I explicitly called trainer.remove_callback(NeptuneCallback)
):
Hey @issamemari ,
I see that you have used report_to: null
in your config file. It should be none
instead of null
. Can you update this as check if the issue is resolved?
Here's the relevant API reference: https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/trainer#transformers.TrainingArguments.report_to
Thanks @SiddhantSadangi, I tried setting report_to
to none
as you suggested. Here's the code and config:
import logging
import hydra
from transformers import Trainer, TrainingArguments
from transformers.integrations import NeptuneCallback
from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor
logger = logging.getLogger(__name__)
@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
logger.info("Creating dataset")
train_dataset = LukalficationDataset(**config.dataset.train)
eval_dataset = LukalficationDataset(**config.dataset.eval)
logger.info("Creating preprocessor")
processor = Processor(
tokenizer_config=config.tokenizer,
label_encoders=[
MultiLabelOneHotEncoder(LEVEL1),
MultiLabelOneHotEncoder(LEVEL2),
MultiLabelOneHotEncoder(LEVEL3),
],
)
logger.info("Creating model")
model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)
training_args = TrainingArguments(**config.training)
logger.info("Training model")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=processor.data_collator,
compute_metrics=MetricsLogger(
label_encoders=processor.label_encoders
).compute_metrics,
callbacks=[NeptuneCallback(log_checkpoints="best")],
)
trainer.train()
logger.info("Training complete")
if __name__ == "__main__":
main()
Config:
model:
freeze_backbone: false
checkpoint_path: null
dataset:
train:
csv_file: ${hydra:runtime.cwd}/data/train.csv
eval:
csv_file: ${hydra:runtime.cwd}/data/eval.csv
tokenizer:
init:
pretrained_model_name_or_path: "Unbabel/xlm-roberta-comet-small"
truncation_side: "left"
call:
max_length: 512
padding: longest
truncation: true
return_tensors: "pt"
training:
output_dir: ${hydra:runtime.cwd}/checkpoints/${now:%Y-%m-%d}/${now:%H-%M}
num_train_epochs: 50
per_device_train_batch_size: 48
per_device_eval_batch_size: 48
learning_rate: 5e-5
warmup_ratio: 0.1
dataloader_num_workers: 8
lr_scheduler_type: linear
greater_is_better: true
metric_for_best_model: "eval_global/f1-score"
full_determinism: false
no_cuda: false
save_strategy: steps
save_steps: 3000
evaluation_strategy: steps
eval_steps: 3000
logging_strategy: steps
logging_steps: 100
optim: adamw_torch
remove_unused_columns: false
report_to: none
load_best_model_at_end: true
The training script still seems to be slow, here's the output from the profiler:
I just recently updated to the newest neptune version with pytorch lightning 2.0.x and torch 1.12.1 and I see extremly slow training. The GPU usage doesn't go above 5%. When doing a debug pass with logging disabled the gpu is used 100%.
@issamemari , this should not be happening ๐ค
Can you share the size of the logged checkpoint? Or, preferably, a link to the logged run so that we can have a look?
If you are uncomfortable sharing it here, you can send it to support@neptune.ai or send it to us through the in-app chat (at the bottom right).
Hello @dscarmo ,
Could you please create a new issue with your case's details so we can have a closer look?
I don't think any checkpoints are being logged because I'm always turning the training script off before it reaches 3000 steps (it's too slow). When I look at the Artifacts tab it doesn't show me any artifacts.
Here's the run.
Hello @bryant1410 ,
Could you please create a new issue with your case's details so we can have a closer look?
I don't have time for it, unfortunately. I'm just not using Neptune for now.
Hello @dscarmo , Could you please create a new issue with your case's details so we can have a closer look?
Is there a safe way to remove all cache files and uynsynced information without breaking Neptune?
I am suspecting this is a issue of interaction between .neptune files from older versions. I had a very old version of Neptune and had to update due to updating pytorch lightning. I had to change from neptune-client to neptune.
If the problem persists after removing .neptune I will make an issue
Also, the slow down happens before any checkpoint logging would happen.
Hello @dscarmo ,
You can use neptune clear
to delete all unsynced information in the .neptune
folder.
Since you also upgraded from neptune-client
to neptune
, I am sharing an upgrade guide that might be useful.
Please let me know if this helps.
@issamemari , I cannot reproduce the issue, but I've asked the team to take a look. I'll keep you updated ๐
@bryant1410 - Is it possible to downgrade to the previous version of Neptune?
Hello @dscarmo ,
You can use
neptune clear
to delete all unsynced information in the.neptune
folder.Since you also upgraded from
neptune-client
toneptune
, I am sharing an upgrade guide that might be useful.Please let me know if this helps.
It seems like I fixed my issue by clearing with Neptune clear, uninstalling Neptune, and installing neptune-client 0.16.16.
Note that neptune clear did not fix anything while still using 1.0.
I think others on this thread might fix their issues by using 0.16.16 too.
As additional information, here are screenshots of gpu usage of the same exact debug run (1 step in training and validation) with 1.0 and with 0.16.16:
0.16
Clearly something is bottlenecking the main training loop hard, maybe intensive disk operations interfering with IO? Might be the same bottleneck the Issue OP found in his profiling.
@bryant1410 - Is it possible to downgrade to the previous version of Neptune?
Yes, thanks
Thanks @dscarmo , the charts would be very helpful in debugging what is happening here.
Having to downgrade is definitely not what we would want you to do, so we will get this checked and fixed ASAP!
@bryant1410 - did downgrading fix your issues as well? Which version did you downgrade to?
Thanks @dscarmo , the charts would be very helpful in debugging what is happening here. Having to downgrade is definitely not what we would want you to do, so we will get this checked and fixed ASAP!
@bryant1410 - did downgrading fix your issues as well? Which version did you downgrade to?
Haven't tried but if I do, I'll let you know
Hello @issamemari , @dscarmo , @bryant1410
We believe that changes introduced to better support forking processes in neptune==1.3.2
might be responsible for this.
Can you please try downgrading to neptune==1.3.1
and let us know if it solves the issue?
@issamemari
Does these profiling results represent real time or CPU time?
We believe that changes introduced to better support forking processes in
neptune==1.3.2
might be responsible for this.
Can you please try downgrading toneptune==1.3.1
and let us know if it solves the issue?
Yeah, I can confirm this occurs to me with 1.3.2 but not with 1.3.1.
Thanks for confirming @bryant1410 ๐
The fix, for now, would be to downgrade to neptune==1.3.1
.
I will update the thread once we release a fix.
Hello all,
We are having some trouble reproducing this issue.
Would it be possible for anyone to share a reproducible example? Code, data sample, hardware details, etc. would help a lot ๐
You can send these to support@neptune.ai.
Hello all,
We are having some trouble reproducing this issue. Would it be possible for anyone to share a reproducible example? Code, data sample, hardware details, etc. would help a lot pray
You can send these to support@neptune.ai.
I have sent this e-mail a reproduction of the issue in Colab.
Thanks @dscarmo ๐
I'll followup via mail if we need more details.
@SiddhantSadangi I confirm that the issue does not happen when using neptune==1.3.1
Just chiming in, I'm encountering the same issue while trying to train a YOLOv5 network. I'm using this commit, with custom data and a simple neptune.init_run()
added to train.py
to log the progress.
Training is fine with neptune-client==1.3.1
but slows to a crawl with 1.3.2
. The training script uses forked data loaders.
Is there any workaround for this aside from downgrading? 1.3.2 fixed the optuna integration for me, but this issue causes it to massively slow down training
Hey @wouterzwerink ,
We are actively investigating this, but unfortunately, we cannot offer any workaround aside from downgrading to 1.3.1
๐
Regarding the optuna issue, was it the StringifyValue has no len()
issue?
If so, could you try this workaround after downgrading to 1.3.1
?
I observe the same problem! On a tiny debugging script we go from 20sec to 9min when upgrading from 1.3.1
to 1.3.2
with num_workers>0
. The massive slow-down is also present in 1.4.0rc0
. Just instantiating the run, without logging anything, is enough to produce this effect. The problem is also present when disabling most tracking features.
source_files=[],
capture_stdout=False,
capture_stderr=False,
capture_traceback=False,
fail_on_exception=False,
Hey all,
We have found a workaround that should work for most cases.
Instead of using fork
for multiprocessing, we recommend using spawn
or forkserver
.
spawn
is already the default in MacOS and Windows. forkserver
is not supported on Windows.
To use spawn
/forkserver
, you will need to have a single entry point for your training script:
if __name__ == "__main__":
multiprocessing.set_start_method("spawn")
train_fn() # or trainer.train(), etc
Here is a minimal example of how you can do this --> Neptune slowdown issue workaround
Note that using spawn
/forkserver
comes with some limitations. You can check all the limitations here
If you do not want to make these changes, you can update the Neptune connection mode to "offline", and then upload the data to Neptune servers using neptune sync
once training is complete.
While we work on a more permanent solution, please let me know if either of these suggestions seem viable workarounds for now ๐
Hello all ๐
We have identified the root cause of this issue and have pushed a fix v1.4.1rc1
๐
With this fix, you will not need to resort to any of the above-mentioned workarounds. โ
We will be publishing the stable version soon, but if you want to test the pre-release version, you can update neptune by running pip install neptune==1.4.1rc1
.
Please let us know if this works ๐
We just released v1.4.1 which fixes this ๐ฅณ
I'll be closing this issue now, but please feel free to let us know if you are still facing issues.