huggingface/deep-rl-class

[HANDS-ON BUG] Unable to push model to the hub, endless loop. Unit 3

MaelSeed opened this issue · 4 comments

Describe the bug

Hi everyone, I ran into a slight hickup when trying to push rl_zoo3 to the huggingface hub. It runs without errors, but at the printout:
"ℹ Pushing repo dqn-SpaceInvadersNoFrameskip-v4 to the Hugging Face Hub"
the cell just keeps running without anymore progress. I ran it yesterday without gpu because I already had done the training, and faced the same problem. Thought its because of the missing gpu so today I tried again, but it seems like I am still stuck. On the huggingface models page, the new model shows up, but its basically empty - "No model card". I used a token with write access and got confirmation. The still running cell shows 2 warnings, one for "no render fps was declared" and another one for "Cloning into local empty directory" but I am not sure if they are related to the problem.
I let it run for the whole 3 hours before colab disconnects the instance, its still stuck at the same line.

I did try to delete my model on huggingface and in google drive in /hub to have a clean start, but it did not help.
I triple checked if there is an error on my end, but I cant find any (which does not mean that there is none of course).

This is the complete output of the cell, it just keeps running in an endless loop afterwards without any more progress:

2023-06-07 08:36:02.719472: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-07 08:36:04.181424: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading latest experiment, id=2
Loading logs/dqn/SpaceInvadersNoFrameskip-v4_2/SpaceInvadersNoFrameskip-v4.zip
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Stacking 4 frames
Wrapping the env in a VecTransposeImage.
Uploading to Gamayun/dqn-SpaceInvadersNoFrameskip-v4, make sure to have the rights
ℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to some
minutes if video generation is activated. This is a work in progress: if you
encounter a bug, please open an issue.
Cloning https://huggingface.co/Gamayun/dqn-SpaceInvadersNoFrameskip-v4 into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://huggingface.co/Gamayun/dqn-SpaceInvadersNoFrameskip-v4 into local empty directory.
Saving model to: hub/dqn-SpaceInvadersNoFrameskip-v4/dqn-SpaceInvadersNoFrameskip-v4
/usr/local/lib/python3.10/dist-packages/gymnasium/utils/passive_env_checker.py:364: UserWarning: WARN: No render fps was declared in the environment (env.metadata['render_fps'] is None or not defined), rendering may occur at inconsistent fps.
  logger.warn(
Saving video to /tmp/tmpxj2niqs1/-step-0-to-step-1000.mp4
Moviepy - Building video /tmp/tmpxj2niqs1/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmpxj2niqs1/-step-0-to-step-1000.mp4

Moviepy - Done !
Moviepy - video ready /tmp/tmpxj2niqs1/-step-0-to-step-1000.mp4
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/tmp/tmpxj2niqs1/-step-0-to-step-1000.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.29.100
  Duration: 00:00:33.40, start: 0.000000, bitrate: 67 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 160x210, 65 kb/s, 30 fps, 30 tbr, 15360 tbn, 60 tbc (default)
    Metadata:
      handler_name    : VideoHandler
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264))
Press [q] to stop, [?] for help
[libx264 @ 0x55e2841b4480] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
[libx264 @ 0x55e2841b4480] profile High, level 1.2
[libx264 @ 0x55e2841b4480] 264 - core 155 r2917 0a84d98 - H.264/MPEG-4 AVC codec - Copyleft 2003-2018 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=3 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to 'hub/dqn-SpaceInvadersNoFrameskip-v4/replay.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.29.100
    Stream #0:0(und): Video: h264 (libx264) (avc1 / 0x31637661), yuv420p, 160x210, q=-1--1, 30 fps, 15360 tbn, 30 tbc (default)
    Metadata:
      handler_name    : VideoHandler
      encoder         : Lavc58.54.100 libx264
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
frame= 1002 fps=0.0 q=-1.0 Lsize=     270kB time=00:00:33.30 bitrate=  66.4kbits/s speed=  38x    
video:259kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 4.141588%
[libx264 @ 0x55e2841b4480] frame I:5     Avg QP:16.64  size:  3773
[libx264 @ 0x55e2841b4480] frame P:508   Avg QP:22.42  size:   433
[libx264 @ 0x55e2841b4480] frame B:489   Avg QP:28.50  size:    53
[libx264 @ 0x55e2841b4480] consecutive B-frames: 28.9% 14.6% 10.2% 46.3%
[libx264 @ 0x55e2841b4480] mb I  I16..4: 31.4% 24.1% 44.4%
[libx264 @ 0x55e2841b4480] mb P  I16..4:  0.5%  0.9%  1.0%  P16..4:  8.1%  3.5%  2.2%  0.0%  0.0%    skip:83.8%
[libx264 @ 0x55e2841b4480] mb B  I16..4:  0.1%  0.1%  0.1%  B16..8:  9.7%  0.9%  0.1%  direct: 0.2%  skip:88.9%  L0:48.2% L1:51.3% BI: 0.5%
[libx264 @ 0x55e2841b4480] 8x8 transform intra:34.7% inter:4.3%
[libx264 @ 0x55e2841b4480] coded y,uvDC,uvAC intra: 27.8% 47.7% 44.5% inter: 1.8% 2.6% 2.2%
[libx264 @ 0x55e2841b4480] i16 v,h,dc,p: 56% 36%  8%  0%
[libx264 @ 0x55e2841b4480] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 21%  8% 68%  2%  0%  0%  0%  0%  0%
[libx264 @ 0x55e2841b4480] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 34% 12% 40%  2%  2%  3%  2%  3%  1%
[libx264 @ 0x55e2841b4480] i8c dc,h,v,p: 54% 26% 19%  2%
[libx264 @ 0x55e2841b4480] Weighted P-Frames: Y:0.0% UV:0.0%
[libx264 @ 0x55e2841b4480] ref P L0: 79.2%  6.2%  8.8%  5.8%
[libx264 @ 0x55e2841b4480] ref B L0: 86.1% 11.4%  2.4%
[libx264 @ 0x55e2841b4480] ref B L1: 95.9%  4.1%
[libx264 @ 0x55e2841b4480] kb/s:63.43
ℹ Pushing repo dqn-SpaceInvadersNoFrameskip-v4 to the Hugging Face
Hub

Link to the notebook:
https://colab.research.google.com/drive/1jR6tkmD-k-zG9aMsM9nqkwDvIvuxjHGs?usp=sharing

Material

I used colab with remode instance, with and without gpu, same error.

I really appreciate the support, thanks!

Hey there 👋 , I just tried again the running and it works. Can you run with --no-graphics the push to hub part to see if it's a video related problem 🤔

Did you solved this issue?

Did you solved this issue?

Thanks for again checking in on my issue, I appreciate it! It turns out that the cells I added to the notebook, where I log into google drive, and change the notebooks working directory to the one where the dqn.yml is located, seem to cause some problems with .push_to_hub(), even though any other code in between runs without issues, including the training. I found out when I, as per your advice, used the original notebook again and ran it without those cells.

Without getting an exception I was unable to investigate the issue in more detail however. I assume that there is some problem in the threaded code in .push_to_hub() that does not work well with Google Drive and/or changing the working directory of the notebook.

Do you think it would be a good idea to open an issue on the RL-Baselines3-Zoo github page? Or any other advice what I should do next?

I have since tried to use the Google Colab Docker on my windows 10 machine with WSL, but I ran into a problem with mounting my system drive to the Docker instance, so any progress there would not be permanent (besides anything I upload to huggingface of course). So in the end, I decided to just install ubuntu using wsl on my windows 10 machine, installed python and any dependencies I needed, and created a venv in which I am now able to run unit 3 without any issues.

Again, thanks for getting back to me, I really appreciate it!

Hey there,

Happy to hear it works now

For the issue I'm not sure it's related to RL-Zoo but Colab 🤔 you can still open an issue to RL Zoo but not sure they'll have a solution.