neptune-ai/neptune-client

BUG: Neptune v1.3.2 is slow when using forked data loaders in Linux

issamemari opened this issue ยท 34 comments

Describe the bug

I have a training script that uses the Trainer class from the HuggingFace Transformers library to train a text classification model. The Trainer class uses a NeptuneCallback to log training metrics to Neptune.

I noticed that my training script is inexplicably slow, so I profiled it using py-spy and found that neptune/internal/threading/daemon.py is taking more than 90% of the execution time of my script. I tried disabling the Neptune callback and saw a dramatic improvement in the speed of the training script.

Reproduction

Here's my training code. I have the Neptune client installed and project name and API key in the environment variables so the Trainer class automatically send logs to Neptune using the NeptuneCallback class.

import logging

import hydra
from transformers import Trainer, TrainingArguments

from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor

logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
    logger.info("Creating dataset")
    train_dataset = LukalficationDataset(**config.dataset.train)
    eval_dataset = LukalficationDataset(**config.dataset.eval)

    logger.info("Creating preprocessor")
    processor = Processor(
        tokenizer_config=config.tokenizer,
        label_encoders=[
            MultiLabelOneHotEncoder(LEVEL1),
            MultiLabelOneHotEncoder(LEVEL2),
            MultiLabelOneHotEncoder(LEVEL3),
        ],
    )

    logger.info("Creating model")
    model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)

    training_args = TrainingArguments(**config.training)

    logger.info("Training model")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=processor.data_collator,
        compute_metrics=MetricsLogger(
            label_encoders=processor.label_encoders
        ).compute_metrics,
    )

    trainer.train()

    logger.info("Training complete")


if __name__ == "__main__":
    main()

Expected behavior

I expect logging to Neptune to not impact the speed of my training script.

Environment

The output of pip list:

Package                    Version
-------------------------- ---------
accelerate                 0.20.3
aiobotocore                2.5.2
aiohttp                    3.8.4
aiohttp-retry              2.8.3
aioitertools               0.11.0
aiosignal                  1.3.1
amqp                       5.1.1
annotated-types            0.5.0
antlr4-python3-runtime     4.9.3
appdirs                    1.4.4
appnope                    0.1.3
arrow                      1.2.3
asn1crypto                 1.5.1
asttokens                  2.2.1
async-timeout              4.0.2
asyncssh                   2.13.2
atpublic                   4.0
attrs                      23.1.0
backcall                   0.2.0
backoff                    2.2.1
beautifulsoup4             4.12.2
billiard                   4.1.0
boto3                      1.26.161
botocore                   1.29.161
bravado                    11.0.3
bravado-core               5.17.1
celery                     5.3.1
certifi                    2023.5.7
cffi                       1.15.1
charset-normalizer         3.2.0
click                      8.1.4
click-didyoumean           0.3.0
click-plugins              1.1.1
click-repl                 0.3.0
colorama                   0.4.6
comm                       0.1.3
configobj                  5.0.8
cryptography               40.0.2
debugpy                    1.6.7
decorator                  5.1.1
dictdiffer                 0.9.0
diskcache                  5.6.1
distro                     1.8.0
dpath                      2.1.6
dulwich                    0.21.5
dvc                        3.5.1
dvc-data                   2.5.0
dvc-http                   2.30.2
dvc-objects                0.23.0
dvc-render                 0.5.3
dvc-s3                     2.23.0
dvc-studio-client          0.11.0
dvc-task                   0.3.0
executing                  1.2.0
filelock                   3.12.2
flatten-dict               0.4.2
flufl.lock                 7.1.1
fqdn                       1.5.1
frozenlist                 1.3.3
fsspec                     2023.6.0
funcy                      2.0
future                     0.18.3
gitdb                      4.0.10
GitPython                  3.1.32
grandalf                   0.8
huggingface-hub            0.16.4
hydra-core                 1.3.2
idna                       3.4
ipykernel                  6.24.0
ipython                    8.14.0
isoduration                20.11.0
iterative-telemetry        0.0.8
jedi                       0.18.2
Jinja2                     3.1.2
jmespath                   1.0.1
joblib                     1.3.1
jsonpointer                2.4
jsonref                    1.1.0
jsonschema                 4.18.0
jsonschema-specifications  2023.6.1
jupyter_client             8.3.0
jupyter_core               5.3.1
kombu                      5.3.1
markdown-it-py             3.0.0
MarkupSafe                 2.1.3
matplotlib-inline          0.1.6
mdurl                      0.1.2
monotonic                  1.6
mpmath                     1.3.0
msgpack                    1.0.5
multidict                  6.0.4
nanotime                   0.5.2
neptune                    1.3.2
nest-asyncio               1.5.6
networkx                   3.1
numpy                      1.25.1
oauthlib                   3.2.2
omegaconf                  2.3.0
orjson                     3.9.2
oscrypto                   1.3.0
packaging                  23.1
pandas                     2.0.3
parso                      0.8.3
pathspec                   0.11.1
pexpect                    4.8.0
pickleshare                0.7.5
Pillow                     10.0.0
pip                        23.1.2
platformdirs               3.8.1
prompt-toolkit             3.0.39
protobuf                   3.19.6
psutil                     5.9.5
ptyprocess                 0.7.0
pure-eval                  0.2.2
pyarrow                    10.0.1
pycparser                  2.21
pycryptodomex              3.18.0
pydantic                   2.0.2
pydantic_core              2.1.2
pydot                      1.4.2
pygit2                     1.12.2
Pygments                   2.15.1
pygtrie                    2.5.0
PyJWT                      2.7.0
pyOpenSSL                  23.2.0
pyparsing                  3.1.0
python-dateutil            2.8.2
pytz                       2023.3
PyYAML                     6.0
pyzmq                      25.1.0
referencing                0.29.1
regex                      2023.6.3
requests                   2.31.0
requests-oauthlib          1.3.1
rfc3339-validator          0.1.4
rfc3987                    1.3.8
rich                       13.4.2
rpds-py                    0.8.10
ruamel.yaml                0.17.32
ruamel.yaml.clib           0.2.7
s3fs                       2023.6.0
s3transfer                 0.6.1
safetensors                0.3.1
scikit-learn               1.3.0
scipy                      1.11.1
scmrepo                    1.0.4
sentencepiece              0.1.99
setuptools                 68.0.0
shortuuid                  1.0.11
shtab                      1.6.2
simplejson                 3.19.1
six                        1.16.0
smmap                      5.0.0
snowflake-connector-python 3.0.4
sortedcontainers           2.4.0
soupsieve                  2.4.1
sqltrie                    0.7.0
stack-data                 0.6.2
swagger-spec-validator     3.0.3
sympy                      1.12
tabulate                   0.9.0
threadpoolctl              3.1.0
tokenizers                 0.13.3
tomlkit                    0.11.8
torch                      2.0.1
tornado                    6.3.2
tqdm                       4.65.0
traitlets                  5.9.0
transformers               4.30.2
typing_extensions          4.7.1
tzdata                     2023.3
uri-template               1.3.0
urllib3                    1.26.16
vine                       5.0.0
voluptuous                 0.13.1
wcwidth                    0.2.6
webcolors                  1.13
websocket-client           1.6.1
wheel                      0.40.0
wrapt                      1.15.0
yarl                       1.9.2
zc.lockfile                3.0.post1

The operating system you're using:
Linux

The output of python --version:
Python 3.11.3

Hello @issamemari ๐Ÿ‘‹

Are you logging model checkpoints to Neptune? Syncing might take a while in case of large checkpoints uploaded frequently.

To check if this is really the case, can you use the callback method of logging with log_checkpoints="best" instead of using Transformer's default report_to? This will upload only the best model checkpoint instead of a checkpoint every 500 steps.

Be sure to update report_to to "none" in TrainingArguments() to avoid creating multiple callbacks.

Using the callback method: https://docs.neptune.ai/integrations/transformers/#__tabbed_2_2
NeptuneCallback() API reference: https://docs.neptune.ai/api/integrations/transformers/#neptunecallback

Please let me know if this works better for you ๐Ÿ™

I've noticed something similar after upgrading Neptune (not sure what version I was using before but it's from the last month or so) but without using Trainer from transformers (just PyTorch).

Hello @bryant1410 ,
Could you please create a new issue with your case's details so we can have a closer look?

I am not logging model checkpoints to Neptune, I'm just saving them to disk. I tried what you suggested, here's the code:

import logging

import hydra
from transformers import Trainer, TrainingArguments
from transformers.integrations import NeptuneCallback

from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor

logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
    logger.info("Creating dataset")
    train_dataset = LukalficationDataset(**config.dataset.train)
    eval_dataset = LukalficationDataset(**config.dataset.eval)

    logger.info("Creating preprocessor")
    processor = Processor(
        tokenizer_config=config.tokenizer,
        label_encoders=[
            MultiLabelOneHotEncoder(LEVEL1),
            MultiLabelOneHotEncoder(LEVEL2),
            MultiLabelOneHotEncoder(LEVEL3),
        ],
    )

    logger.info("Creating model")
    model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)

    training_args = TrainingArguments(**config.training)

    logger.info("Training model")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=processor.data_collator,
        compute_metrics=MetricsLogger(
            label_encoders=processor.label_encoders
        ).compute_metrics,
        callbacks=[NeptuneCallback(log_checkpoints="best")],
    )

    trainer.train()

    logger.info("Training complete")


if __name__ == "__main__":
    main()

And here's the config file:

model:
  freeze_backbone: false
  checkpoint_path: null


dataset:
  train:
    csv_file: ${hydra:runtime.cwd}/data/train.csv
  eval:
    csv_file: ${hydra:runtime.cwd}/data/eval.csv


tokenizer:
  init:
    pretrained_model_name_or_path: "Unbabel/xlm-roberta-comet-small"
    truncation_side: "left"
  call:
    max_length: 512
    padding: longest
    truncation: true
    return_tensors: "pt"


training:
  output_dir: ${hydra:runtime.cwd}/checkpoints/${now:%Y-%m-%d}/${now:%H-%M}
  num_train_epochs: 50
  per_device_train_batch_size: 48
  per_device_eval_batch_size: 48
  learning_rate: 5e-5
  warmup_ratio: 0.1
  dataloader_num_workers: 8
  lr_scheduler_type: linear
  greater_is_better: true
  metric_for_best_model: "eval_global/f1-score"
  full_determinism: false
  no_cuda: false
  save_strategy: steps
  save_steps: 3000
  evaluation_strategy: steps
  eval_steps: 3000
  logging_strategy: steps
  logging_steps: 100
  optim: adamw_torch
  remove_unused_columns: false
  report_to: null
  load_best_model_at_end: true

Here's the output of my profiler with the Neptune callback:
profile

And here's the output without the Neptune callback (I explicitly called trainer.remove_callback(NeptuneCallback)):
profile

Hey @issamemari ,

I see that you have used report_to: null in your config file. It should be none instead of null. Can you update this as check if the issue is resolved?

Here's the relevant API reference: https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/trainer#transformers.TrainingArguments.report_to

Thanks @SiddhantSadangi, I tried setting report_to to none as you suggested. Here's the code and config:

import logging

import hydra
from transformers import Trainer, TrainingArguments
from transformers.integrations import NeptuneCallback

from dataset import LukalficationDataset
from dataset.classes import LEVEL1, LEVEL2, LEVEL3
from metrics import MetricsLogger
from model import LukalificationModel
from preprocessing import MultiLabelOneHotEncoder, Processor

logger = logging.getLogger(__name__)


@hydra.main(config_path=".", config_name="config.yaml", version_base="1.2")
def main(config):
    logger.info("Creating dataset")
    train_dataset = LukalficationDataset(**config.dataset.train)
    eval_dataset = LukalficationDataset(**config.dataset.eval)

    logger.info("Creating preprocessor")
    processor = Processor(
        tokenizer_config=config.tokenizer,
        label_encoders=[
            MultiLabelOneHotEncoder(LEVEL1),
            MultiLabelOneHotEncoder(LEVEL2),
            MultiLabelOneHotEncoder(LEVEL3),
        ],
    )

    logger.info("Creating model")
    model = LukalificationModel([len(LEVEL1), len(LEVEL2), len(LEVEL3)], **config.model)

    training_args = TrainingArguments(**config.training)

    logger.info("Training model")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=processor.data_collator,
        compute_metrics=MetricsLogger(
            label_encoders=processor.label_encoders
        ).compute_metrics,
        callbacks=[NeptuneCallback(log_checkpoints="best")],
    )

    trainer.train()

    logger.info("Training complete")


if __name__ == "__main__":
    main()

Config:

model:
  freeze_backbone: false
  checkpoint_path: null


dataset:
  train:
    csv_file: ${hydra:runtime.cwd}/data/train.csv
  eval:
    csv_file: ${hydra:runtime.cwd}/data/eval.csv


tokenizer:
  init:
    pretrained_model_name_or_path: "Unbabel/xlm-roberta-comet-small"
    truncation_side: "left"
  call:
    max_length: 512
    padding: longest
    truncation: true
    return_tensors: "pt"


training:
  output_dir: ${hydra:runtime.cwd}/checkpoints/${now:%Y-%m-%d}/${now:%H-%M}
  num_train_epochs: 50
  per_device_train_batch_size: 48
  per_device_eval_batch_size: 48
  learning_rate: 5e-5
  warmup_ratio: 0.1
  dataloader_num_workers: 8
  lr_scheduler_type: linear
  greater_is_better: true
  metric_for_best_model: "eval_global/f1-score"
  full_determinism: false
  no_cuda: false
  save_strategy: steps
  save_steps: 3000
  evaluation_strategy: steps
  eval_steps: 3000
  logging_strategy: steps
  logging_steps: 100
  optim: adamw_torch
  remove_unused_columns: false
  report_to: none
  load_best_model_at_end: true

The training script still seems to be slow, here's the output from the profiler:
profile

I just recently updated to the newest neptune version with pytorch lightning 2.0.x and torch 1.12.1 and I see extremly slow training. The GPU usage doesn't go above 5%. When doing a debug pass with logging disabled the gpu is used 100%.

@issamemari , this should not be happening ๐Ÿค”

Can you share the size of the logged checkpoint? Or, preferably, a link to the logged run so that we can have a look?
If you are uncomfortable sharing it here, you can send it to support@neptune.ai or send it to us through the in-app chat (at the bottom right).

Hello @dscarmo ,
Could you please create a new issue with your case's details so we can have a closer look?

I don't think any checkpoints are being logged because I'm always turning the training script off before it reaches 3000 steps (it's too slow). When I look at the Artifacts tab it doesn't show me any artifacts.

Here's the run.

Hello @bryant1410 ,
Could you please create a new issue with your case's details so we can have a closer look?

I don't have time for it, unfortunately. I'm just not using Neptune for now.

Hello @dscarmo , Could you please create a new issue with your case's details so we can have a closer look?

Is there a safe way to remove all cache files and uynsynced information without breaking Neptune?

I am suspecting this is a issue of interaction between .neptune files from older versions. I had a very old version of Neptune and had to update due to updating pytorch lightning. I had to change from neptune-client to neptune.

If the problem persists after removing .neptune I will make an issue

Also, the slow down happens before any checkpoint logging would happen.

Hello @dscarmo ,

You can use neptune clear to delete all unsynced information in the .neptune folder.

Since you also upgraded from neptune-client to neptune, I am sharing an upgrade guide that might be useful.

Please let me know if this helps.

@issamemari , I cannot reproduce the issue, but I've asked the team to take a look. I'll keep you updated ๐Ÿ™

@bryant1410 - Is it possible to downgrade to the previous version of Neptune?

Hello @dscarmo ,

You can use neptune clear to delete all unsynced information in the .neptune folder.

Since you also upgraded from neptune-client to neptune, I am sharing an upgrade guide that might be useful.

Please let me know if this helps.

It seems like I fixed my issue by clearing with Neptune clear, uninstalling Neptune, and installing neptune-client 0.16.16.

Note that neptune clear did not fix anything while still using 1.0.

I think others on this thread might fix their issues by using 0.16.16 too.

As additional information, here are screenshots of gpu usage of the same exact debug run (1 step in training and validation) with 1.0 and with 0.16.16:

image
1.0

image

0.16

Clearly something is bottlenecking the main training loop hard, maybe intensive disk operations interfering with IO? Might be the same bottleneck the Issue OP found in his profiling.

@bryant1410 - Is it possible to downgrade to the previous version of Neptune?

Yes, thanks

Thanks @dscarmo , the charts would be very helpful in debugging what is happening here.
Having to downgrade is definitely not what we would want you to do, so we will get this checked and fixed ASAP!

@bryant1410 - did downgrading fix your issues as well? Which version did you downgrade to?

Thanks @dscarmo , the charts would be very helpful in debugging what is happening here. Having to downgrade is definitely not what we would want you to do, so we will get this checked and fixed ASAP!

@bryant1410 - did downgrading fix your issues as well? Which version did you downgrade to?

Haven't tried but if I do, I'll let you know

Hello @issamemari , @dscarmo , @bryant1410

We believe that changes introduced to better support forking processes in neptune==1.3.2 might be responsible for this.
Can you please try downgrading to neptune==1.3.1 and let us know if it solves the issue?

@issamemari
Does these profiling results represent real time or CPU time?

We believe that changes introduced to better support forking processes in neptune==1.3.2 might be responsible for this.
Can you please try downgrading to neptune==1.3.1 and let us know if it solves the issue?

Yeah, I can confirm this occurs to me with 1.3.2 but not with 1.3.1.

Thanks for confirming @bryant1410 ๐Ÿ™

The fix, for now, would be to downgrade to neptune==1.3.1.
I will update the thread once we release a fix.

Hello all,

We are having some trouble reproducing this issue.
Would it be possible for anyone to share a reproducible example? Code, data sample, hardware details, etc. would help a lot ๐Ÿ™

You can send these to support@neptune.ai.

Hello all,

We are having some trouble reproducing this issue. Would it be possible for anyone to share a reproducible example? Code, data sample, hardware details, etc. would help a lot pray

You can send these to support@neptune.ai.

I have sent this e-mail a reproduction of the issue in Colab.

Thanks @dscarmo ๐Ÿ™

I'll followup via mail if we need more details.

@SiddhantSadangi I confirm that the issue does not happen when using neptune==1.3.1

Just chiming in, I'm encountering the same issue while trying to train a YOLOv5 network. I'm using this commit, with custom data and a simple neptune.init_run() added to train.py to log the progress.

Training is fine with neptune-client==1.3.1 but slows to a crawl with 1.3.2. The training script uses forked data loaders.

Is there any workaround for this aside from downgrading? 1.3.2 fixed the optuna integration for me, but this issue causes it to massively slow down training

Hey @wouterzwerink ,
We are actively investigating this, but unfortunately, we cannot offer any workaround aside from downgrading to 1.3.1 ๐Ÿ˜ž

Regarding the optuna issue, was it the StringifyValue has no len() issue?
If so, could you try this workaround after downgrading to 1.3.1?

I observe the same problem! On a tiny debugging script we go from 20sec to 9min when upgrading from 1.3.1 to 1.3.2 with num_workers>0. The massive slow-down is also present in 1.4.0rc0. Just instantiating the run, without logging anything, is enough to produce this effect. The problem is also present when disabling most tracking features.

            source_files=[],
            capture_stdout=False,
            capture_stderr=False,
            capture_traceback=False,
            fail_on_exception=False,

Hey all,
We have found a workaround that should work for most cases.

Instead of using fork for multiprocessing, we recommend using spawn or forkserver.
spawn is already the default in MacOS and Windows. forkserver is not supported on Windows.

To use spawn/forkserver, you will need to have a single entry point for your training script:

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")
    train_fn() # or trainer.train(), etc

Here is a minimal example of how you can do this --> Neptune slowdown issue workaround

Note that using spawn/forkserver comes with some limitations. You can check all the limitations here

If you do not want to make these changes, you can update the Neptune connection mode to "offline", and then upload the data to Neptune servers using neptune sync once training is complete.

While we work on a more permanent solution, please let me know if either of these suggestions seem viable workarounds for now ๐Ÿ™

Hello all ๐Ÿ‘‹

We have identified the root cause of this issue and have pushed a fix v1.4.1rc1 ๐ŸŽ‰

With this fix, you will not need to resort to any of the above-mentioned workarounds. โœ…

We will be publishing the stable version soon, but if you want to test the pre-release version, you can update neptune by running pip install neptune==1.4.1rc1.

Please let us know if this works ๐Ÿ™

We just released v1.4.1 which fixes this ๐Ÿฅณ

I'll be closing this issue now, but please feel free to let us know if you are still facing issues.