Broken pipe caused by ZeroDivisionError

Question

Broken pipe caused by ZeroDivisionError

DEQDON opened this issue 4 years ago · 3 comments

Hi, I ran into a ZeroDivisionError which causes Broken pipe. The command I ran was python main.py -n1 --auto_gpu_config 0 --split val
At first, the code was all good, and I could see the training process with losses and stuff. But after some time, it broke down and the error log is as follows (the time log is kind of messy because I combined two runs together for a full log history):

Dumping at ./tmp//models/exp1/
Namespace(alpha=0.99, auto_gpu_config=0, camera_height=1.25, clip_param=0.2, collision_threshold=0.2, cuda=True, du_scale=2, dump_location='./tmp/', entropy_coef=0.001, env_frame_height=256, env_frame_width=256, eps=1e-05, eval=0, exp_loss_coeff=1.0, exp_name='exp1', frame_height=128, frame_width=128, gamma=0.99, global_downscaling=2, global_hidden_size=256, global_lr=2.5e-05, goals_size=2, hfov=90.0, load_global='0', load_local='0', load_slam='0', local_hidden_size=512, local_optimizer='adam,lr=0.0001', local_policy_update_freq=5, log_interval=10, map_pred_threshold=0.5, map_resolution=5, map_size_cm=2400, max_episode_length=1000, max_grad_norm=0.5, no_cuda=False, noise_level=1.0, noisy_actions=1, noisy_odometry=1, num_episodes=1000000, num_global_steps=40, num_local_steps=25, num_mini_batch=0, num_processes=1, num_processes_on_first_gpu=0, num_processes_per_gpu=11, obs_threshold=1, obstacle_boundary=5, pose_loss_coeff=10000.0, ppo_epoch=4, pretrained_resnet=1, print_images=0, proj_loss_coeff=1.0, randomize_env_every=1000, save_interval=1, save_periodic=500000, save_trajectory_data='0', seed=1, short_goal_dist=1, sim_gpu_id=0, slam_batch_size=72, slam_iterations=10, slam_memory_size=500000, slam_optimizer='adam,lr=0.0001', split='val', task_config='tasks/pointnav_gibson.yaml', tau=0.95, total_num_scenes='auto', train_global=1, train_local=1, train_slam=1, use_deterministic_local=0, use_gae=False, use_pose_estimation=2, use_recurrent_global=0, use_recurrent_local=1, value_loss_coef=0.5, vis_type=1, vision_range=64, visualize=0)
Loading data/scene_datasets/gibson/Cantwell.glb
2021-03-24 01:48:47,936 initializing sim Sim-v0
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0324 01:48:47.955277 25783 WindowlessContext.cpp:98] [EGL] Detected 6 EGL devices
I0324 01:48:47.955494 25783 WindowlessContext.cpp:119] [EGL] Selected EGL device 1 for CUDA device 0
I0324 01:48:47.958079 25783 WindowlessContext.cpp:133] [EGL] Version: 1.5
I0324 01:48:47.958138 25783 WindowlessContext.cpp:134] [EGL] Vendor: NVIDIA
Renderer: TITAN V/PCIe/SSE2 by NVIDIA Corporation
OpenGL version: 4.6.0 NVIDIA 455.45.01
Using optional features:
    GL_ARB_ES2_compatibility
    GL_ARB_direct_state_access
    GL_ARB_get_texture_sub_image
    GL_ARB_invalidate_subdata
    GL_ARB_multi_bind
    GL_ARB_robustness
    GL_ARB_separate_shader_objects
    GL_ARB_texture_filter_anisotropic
    GL_ARB_texture_storage
    GL_ARB_texture_storage_multisample
    GL_ARB_vertex_array_object
    GL_KHR_debug
Using driver workarounds:
    nv-egl-incorrect-gl11-function-pointers
    no-layout-qualifiers-on-old-glsl
    nv-zero-context-profile-mask
    nv-implementation-color-read-format-dsa-broken
    nv-cubemap-inconsistent-compressed-image-size
    nv-cubemap-broken-full-compressed-image-query
    nv-compressed-block-size-in-bits
I0324 01:48:53.312220 25783 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Cantwell.navmesh
2021-03-24 01:48:53,313 initializing task Nav-v0
2021-03-24 01:48:53,324 Computing map for data/scene_datasets/gibson/Cantwell.glb
Time: 00d 00h 00m 00s, num timesteps 0, FPS 0,
        Rewards:
        Losses:
Time: 00d 00h 00m 01s, num timesteps 10, FPS 8,
        Rewards:
        Losses: Local Loss: 5.300,
Time: 00d 00h 00m 05s, num timesteps 20, FPS 3,
        Rewards:
        Losses: Local Loss: 5.339,
Time: 00d 00h 00m 13s, num timesteps 30, FPS 2,
        Rewards:
        Losses: Local Loss: 5.373,
Time: 00d 00h 00m 20s, num timesteps 40, FPS 1,
        Rewards:
        Losses: Local Loss: 5.161,
Time: 00d 00h 00m 27s, num timesteps 50, FPS 1,
        Rewards:
        Losses: Local Loss: 5.256,
Time: 00d 00h 00m 34s, num timesteps 60, FPS 1,
        Rewards:
        Losses: Local Loss: 5.418,
Time: 00d 00h 00m 41s, num timesteps 70, FPS 1,
        Rewards:
        Losses: Local Loss: 5.383,
Time: 00d 00h 00m 55s, num timesteps 80, FPS 1,
        Rewards:
        Losses: Local Loss: 5.250, SLAM Loss proj/exp/pose:0.2753/0.4042/1.0404
Time: 00d 00h 01m 10s, num timesteps 90, FPS 1,
        Rewards:
        Losses: Local Loss: 5.295, SLAM Loss proj/exp/pose:0.1872/0.2451/0.5508
Time: 00d 00h 01m 25s, num timesteps 100, FPS 1,
        Rewards:
        Losses: Local Loss: 5.321, SLAM Loss proj/exp/pose:0.1511/0.1848/0.3863

...

I0324 01:18:32.724668 23107 WindowlessContext.cpp:98] [EGL] Detected 6 EGL devices
I0324 01:18:32.724862 23107 WindowlessContext.cpp:119] [EGL] Selected EGL device 0 for CUDA device 1
I0324 01:18:32.727066 23107 WindowlessContext.cpp:133] [EGL] Version: 1.5
I0324 01:18:32.727088 23107 WindowlessContext.cpp:134] [EGL] Vendor: NVIDIA
I0324 01:18:37.851421 23107 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Cantwell.navmesh
2021-03-24 01:18:37,854 initializing task Nav-v0
2021-03-24 01:18:37,876 Computing map for data/scene_datasets/gibson/Cantwell.glb
2021-03-24 01:41:46,300 Computing map for data/scene_datasets/gibson/Cantwell.glb
Traceback (most recent call last):
  File "main.py", line 769, in <module>
    main()
  File "main.py", line 617, in main
    g_agent.update(g_rollouts)
  File "/home/xxx/Neural-SLAM/algo/ppo.py", line 58, in update
    for sample in data_generator:
  File "/home/xxx/Neural-SLAM/utils/storage.py", line 95, in feed_forward_generator
    mini_batch_size = batch_size // num_mini_batch
ZeroDivisionError: integer division or modulo by zero
Exception ignored in: <function VectorEnv.__del__ at 0x7fc52a9e2ef0>
Traceback (most recent call last):
  File "/home/xxx/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 487, in __del__
    self.close()
  File "/home/xxx/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 351, in close
    write_fn((CLOSE_COMMAND, None))
  File "/home/xxx/anaconda3/envs/comp765/lib/python3.7/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/xxx/anaconda3/envs/comp765/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/home/xxx/anaconda3/envs/comp765/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Could anyone please tell me what to do with this?

Answer 1 · 2021-03-29T21:28:09.000Z

I met the same error. Have you solved it?

Answer 2 · 2021-04-01T02:12:34.000Z

Please add --num_mini_batch 1 to the command, this should fix the error.

Answer 3 · 2021-04-10T23:34:24.000Z

Thanks, this works!