uzh-rpg/RVT

The error when using wandb

Closed this issue · 11 comments

Hatins commented

Hi @magehrig
I meet a new problem when using wandb, which may be caused by an error in the network. This error includes such information:

urllib3.exceptions.ProxyError: ('Unable to connect to proxy', ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f618c122fd0>, 'Connection to 115.156.95.129 timed out. (connect timeout=9)')
)

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by ProxyError('Unable to connect to proxy', ConnectTimeoutError(<urllib3.
connection.HTTPSConnection object at 0x7f618c122fd0>, 'Connection to 115.156.95.129 timed out. (connect timeout=9)')))   

requests.exceptions.ProxyError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by ProxyError('Unable to connect to proxy', ConnectTimeoutError(<urllib3.co
nnection.HTTPSConnection object at 0x7f618c122fd0>, 'Connection to 115.156.95.129 timed out. (connect timeout=9)')))             

You know, I have successfully run your code a long time ago, but this error is only recently.
I know the source of the problem comes from network issues, but I was wondering if you have some solution to deal with this problem?

Best!
Haitins

Hi @Hatins
This is because the logger is using the online wandb service which relies on wandb cloud api reaching their servers. Apparently it's not always so stable so this issues occur.

A workaround is to use the default wandb logger from pytorch lightning or any other logger. Or just wait until the issue disappears (which is what I usually do).

Hatins commented

Hi @magehrig
Luckily for me, you answered so quickly. As you said it's a really bad problem, I've been having this bug for 3 days now. And every time it will be interrupted in the process of running the code. If I use offline wandb mode and upload data when the network is good, will it be the same?

By the way, if you know how to set the offline, please tell me since I don't know which way is the most suitable (there are too many different ways on the Internet to set the wandb as offline...)

Now I did it by:

        self._wandb_init = dict(
            name=name,
            project=project,
            group=group,
            id=wandb_id,
            resume="allow",
            save_code=True,
            mode = 'offline'
        )

However, I also get the error as:

  File "/home/zht/python_project/RVT_OWOD_v1/loggers/wandb_logger.py", line 236, in _num_logged_artifact                                                                                      
    public_run = self._get_public_run()                                                                                                                                                       
  File "/home/zht/python_project/RVT_OWOD_v1/loggers/wandb_logger.py", line 230, in _get_public_run                                                                                           
    runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id                                                                                                       
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'      

Best!
Haitins

You can switch the logger I wrote to anyone other than Pytorch Lightning provides. You have to replace this line to initialize the new logger.

Hatins commented

Hi @magehrig
Thanks for your advice! I have known what your mean and I change that line you mention as

logger = WandbLogger(project="RVT_OWOD",group="version_1")

However, I still get an error:
image
So maybe it still needs some revises in order to follow it-style, but I don't know where I need to modify, so I may need your help! And I was wondering if it might be possible for you to make some adjustments to your code, if it's not too much trouble, in order to achieve full functionality.

You can remove these asserts (those that enforce that type(logger) is WandbLogger) and try again. In general, I have not written the code having in mind that other loggers will be used so you have to improvise slightly here. Let me know how it goes.

Hatins commented

I got it, I follow the instruction and removed these asserts, but I got an error:

 File "/home/zht/Python_project/RVT_OWOD_v1/callbacks/detection.py", line 98, in on_validation_epoch_end_custom
   logger.log_images(key='val/predictions',
AttributeError: 'WandbLogger' object has no attribute 'log_images'

So I remove that function again:

        # logger.log_images(key='val/predictions',
        #                   images=merged_img,
        #                   caption=captions)

Now the code seems to be running smoothly, but I will need some time to verify if there will be any further errors. And I would like to know if there will be any negative impacts after I remove this piece of code.

Best!
Hatins

The obvious consequence is that you are not logging the "merged_img" anymore. That should be fine if you don't want it to be logged. Because you exchanged the logger you will have to adapt the code slightly to reload from the checkpoint and resume the training.

Hatins commented

Hi @magehrig
I wanted to let you know that I've understood your guidance. As a result, the code is now running smoothly, and if needed, I am fully prepared to make any necessary revisions myself!

I extend my heartfelt appreciation for your invaluable assistance once more! Wishing you happiness every day!

I'm happy for your success, Hatins, unfortunately I have not managed to reproduce your solution.
Is it possible to document the necessary changes for an offline run here, as it is quite time consuming to get there. I assume there are not many changes and it would be very nice if you would do that for the following users.
Maybe the open issue of this topic might be the best place. To give back what you got.

Tank you magehrig for sharing your great work.

Hatins commented

@vanAken
Hi, vanAken, don't be worried, I,d like to help!
The first step, you should import the default wandblogger in pytorch_lightning as:

from pytorch_lightning.loggers import WandbLogger

To make sure the wandb can identify your your count, you should assign the related parameters at the begin:

os.environ["WANDB_API_KEY"] = 'xxxxxxxxxxxxxxxxx'
os.environ["WANDB_MODE"] = "offline"

Then replace the code:

# logger = get_wandb_logger(config)

as

logger = WandbLogger(project=config.wandb.project_name,name='xxx', group=config.wandb.group_name)

Note these steps should be done in train.py.

After that, you should comment out some codes about the visualization, which were realized by @magehrig in callbacks/detection.py (line65-68 and line98-100)

        # logger.log_images(key='train/predictions',
        #                   images=merged_img,
        #                   caption=captions,
        #                   step=global_step)
        # logger.log_images(key='val/predictions',
        #                   images=merged_img,
        #                   caption=captions)

After that, you can run the code in the offline mode, however, the function of visualization will be unusable.

Thanks to Hatins for your help.
Now it's training. After 20h it has 100 000 iterations and the first epoch isn't finished yet.

As you mentioned above, the asserts need to be removed twice in callbacks/viz_base.py!
Comment out line 91 and line 155

#assert isinstance(logger, WandbLogger)

Thanks to Hatins, you are doing a great Job at the UZH.