BUG: Error msg during training - Timestamp must be non-decreasing for series attribute
Opened this issue · 22 comments
Describe the bug
When running Neptune logger in PytorchLightning with ddp > 1gpus.
Then there are continuous errors reading Error occured during asynchronous operation processing. Timestamp must be non-decreasing for series attribute
If the Neptune logger is offline, or if neptune logger is removed then this error isn't logged.
There are too many errors, and even the progress bar of the training is difficult to identify.
Reproduction
When i run with 4gpus, i was able to reproduce this.
https://colab.research.google.com/drive/1TOadmpet63eSXz6LMHVvdM-D6Gy0LDxe?usp=sharing
Expected behavior
If this is a valid error message, then there is no hint of what actions needs to be taken. If they are harmless/not valid kindly suggest a way to suppress this print.
Traceback
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z
Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z
Environment
The output of pip list
:
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31
Python version: 3.8.11 (default, Aug 3 2021, 15:09:35) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: RTX A6000
GPU 1: RTX A6000
GPU 2: RTX A6000
GPU 3: RTX A6000
GPU 4: RTX A6000
GPU 5: RTX A6000
GPU 6: RTX A6000
GPU 7: RTX A6000
Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
Versions of relevant libraries:
[pip3] mypy==0.910
[pip3] mypy-extensions==0.4.3
[pip3] neptune-pytorch-lightning==0.9.7
[pip3] numpy==1.21.2
[pip3] pytorch-lightning==1.4.9
[pip3] torch==1.9.0+cu111
[pip3] torch-poly-lr-decay==0.0.1
[pip3] torchaudio==0.9.0
[pip3] torchmetrics==0.4.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] mypy 0.910 pypi_0 pypi
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] neptune-client 0.12.0 pypi_0 pypi
[conda] neptune-contrib 0.27.3 pypi_0 pypi
[conda] neptune-pytorch-lightning 0.9.7 pypi_0 pypi
[conda] numpy 1.21.1 pypi_0 pypi
[conda] numpy-base 1.21.2 py38h79a1101_0
[conda] pytorch-lightning 1.4.9 pypi_0 pypi
[conda] torch 1.9.0+cu111 pypi_0 pypi
[conda] torch-poly-lr-decay 0.0.1 pypi_0 pypi
[conda] torchaudio 0.9.0 pypi_0 pypi
[conda] torchmetrics 0.4.1 pypi_0 pypi
Additional context
Hey @stonelazy,
Thanks for this report.
You have more than 1 process writing in async way to the same field in neptune (monitoring/gpu_3
for example). In Neptune, for given field, we expect the step
to be strictly increasing (x, x+1, x+1.0001, x+4, x+5, ...). When it's not true you see an error that you reported here. As a consequence this point is dropped and your run progress uninterrupted.
In common use like run["x"].log(2)
, Neptune takes consecutive integers staring at 0. This is what you observe in your loss charts on x axis.
So what is happening in your case is the following (all processes log to the same run, async):
process A: run["x"].log(value=7, step=1) (ok)
process B: run["x"].log(value=3, step=2) (ok)
process A: run["x"].log(value=1, step=2) (error)
process A: run["x"].log(value=4, step=3) (ok)
process B: run["x"].log(value=0, step=3) (error)
process A: run["x"].log(value=9, step=4) (ok)
In it not an error that will crash your run, at the same time, I see that it can be annoying.
Quick workaround is to disable stdout/err and hardware monitoring - as this is what I see gives you the error msgs (docs).
neptune.init(
capture_stdout=False,
capture_stderr=False,
capture_hardware_metrics=False,
...,
)
You will not get these error msgs at the expense of less granular view into your runs.
At Neptune we see this use case, and we are working on better handling of such needs.
Wanted to ask if you would be open to have a short online meeting with me -> I'm preparing some best practices for our users and - at the same time - I'm passing insights to the product team. I see that your insights/needs would help us a lot in making it right! https://calendly.com/kamil-neptune/30min
Thanks for the details. Yeah, i had it figured on disabling those 3 flags i didn't get the error and have been running my training with that for few days now. But the workaround comes at a dataloss, no longer we would be able to monitor CPU/GPU metrics. This workaround doesn't seem to be convincing.
And continuous error messages are totally annoying. I guess this issue feels like a showstopper for me.
I appreciate your feedback. I will get back here with more details soon.
For now I have some temporary workaround. You can try to suppress these errors like this:
import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)
Not sure if it will work for GPUs as well, but it's worth trying. At the same time please use it wisely and as temporary workaround as it disables all errors from async processing.
I already passed info about your case to the product team for further consideration.
Let me know what you think?
Hi @stonelazy,
I'm thinking what is your current status / thinking when it comes to the above issues?
Hi @kamil-kaczmarek am now not having any issues with the suggested workaround.
@stonelazy ok good,
I recommend to transition to the PTL 1.5 and use our new logger:
- documentation
- PyTorch Lightning docs about Neptune Logger
If we move to PTL 1.5, we won't have this error ?
@stonelazy unfortunately not, I'm sorry.
Referred to the new logger, due to improvements in the logger in other dimensions, especially it uses our new API under the hood.
For now I have some temporary workaround. You can try to suppress these errors like this:
import logging logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)
For people stumbling on this issue: you can also "cleanly" silence neptune logs in kedro using the logging.yml
configuration file. To do so, you can add those lines in the loggers
section of the file in conf/base/logging.yml
:
loggers:
# [ OTHER LOGGERS CONFIGURATION ...]
neptune.new.internal.operation_processors.async_operation_processor:
level: CRITICAL
handlers: [console, info_file_handler, error_file_handler]
propagate: no
Hi @stonelazy,
Prince Canuma here, a Data Scientist at Neptune.ai,
I want to personally inform you of the good news! This issue is now fixed on the latest release of PyTorch-Lightning v1.5.7 🎊 🥳
All you need to do is upgrade the library to the latest release👍
Happy Christmas and a prosperous New Year in advance!
I'm still seeing this error with pytorch-lightning==1.9.1
and neptune-client==0.16.17
.
Could you share a minimal reproducible code and the setup you have(i.e., no of GPUs and strategy) ?
The issue is still present. Just logging the hardware metrics with more than 1 process results in a lot of spam. I don't know why this issue is closed. Please reopen it.
Hi @mrembalski
Bummer that you're experiencing some issues. No worries, we're here to help!
I have a couple of questions that'll help me understand the situation better:
- Are you using Pytorch Lightning or just pure Pytorch?
- Could you share a minimal reproducible code snippet and your current setup (like the number of GPUs and the strategy you're using)?
Once I have this info, we'll dive right in and get things sorted out for you!
By the way, if you are using PyTorch we have since released a guide on our docs on how to track DDP jobs. Read more here:
https://docs.neptune.ai/tutorials/running_distributed_training/
I'm using Pytorch Lightning. I'll try to create a snippet tomorrow.
Perfect!
I am looking forward to it :)
Hi @Blaizzy I ham trying to use Neptune logger in pure Pytorch through Fairseq i.e trying to integrate Neptune logger for Fairseq package which already supports lots of your comptetitors like, WANDB, AzureML, Aim etc.
I am also facing the same issue. I have run the job only on one GPU rank and still getting the logs as stated above(reproduced below).
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.659Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.985Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.987Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.370Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.371Z
Where is this exactly an issue ? As in, is there someway I can make changes to prevent this issue from occuring, some kind of sync/wait primitives available in neptune that I am not calling ? Any direction would help me better handle the logger implementation for Fairseq.
For now I have some temporary workaround. You can try to suppress these errors like this:
import logging logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)
Not sure if it will work for GPUs as well, but it's worth trying. At the same time please use it wisely and as temporary workaround as it disables all errors from async processing.
I already passed info about your case to the product team for further consideration.
Let me know what you think?
Hey, the import changed (the "new" is gone from the path). Here is how I ignore these messages for now:
logging.getLogger("neptune.internal.operation_processors.async_operation_processor").setLevel(
logging.CRITICAL
)
This issue is still very much not resolved, despite having been marked as resolved. Our logs are absolutely illegible, because this message appears millions and millions of times:
Note that this is training on a single machine, so this cannot be caused by multiple loggers with slightly out-of-sync clocks. As far as I can tell, Neptune is actually trying to log messages with identical timestamps as keys, and this is causing this error. The following 3 messages, all have the same timestamp up to a millisecond precision. Unless Neptune uses microsecond precision for the logs, I think this must be the issue:
2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
-- | --
2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
On a different note, a better way to filter out these spurious error messages without affecting actual legitimate error logs is to do something like:
class _FilterCallback(logging.Filterer):
@override
def filter(self, record: logging.LogRecord):
return not (
record.name == "neptune"
and record.getMessage().startswith(
"Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute"
)
)
...
neptune.internal.operation_processors.async_operation_processor.logger.addFilter(_FilterCallback())
Hey @danieldanciu 👋
Thanks for pointing this out. While not a bug (more of a design issue), it should not have been marked as resolved.
We are working on an updated client version that will address this issue. Meanwhile, the workaround you've suggested to filter out such logs works like a charm, and we'll add it to our docs ✅