Broken pipe caused by ZeroDivisionError
DEQDON opened this issue · 3 comments
DEQDON commented
Hi, I ran into a ZeroDivisionError which causes Broken pipe. The command I ran was python main.py -n1 --auto_gpu_config 0 --split val
At first, the code was all good, and I could see the training process with losses and stuff. But after some time, it broke down and the error log is as follows (the time log is kind of messy because I combined two runs together for a full log history):
Dumping at ./tmp//models/exp1/
Namespace(alpha=0.99, auto_gpu_config=0, camera_height=1.25, clip_param=0.2, collision_threshold=0.2, cuda=True, du_scale=2, dump_location='./tmp/', entropy_coef=0.001, env_frame_height=256, env_frame_width=256, eps=1e-05, eval=0, exp_loss_coeff=1.0, exp_name='exp1', frame_height=128, frame_width=128, gamma=0.99, global_downscaling=2, global_hidden_size=256, global_lr=2.5e-05, goals_size=2, hfov=90.0, load_global='0', load_local='0', load_slam='0', local_hidden_size=512, local_optimizer='adam,lr=0.0001', local_policy_update_freq=5, log_interval=10, map_pred_threshold=0.5, map_resolution=5, map_size_cm=2400, max_episode_length=1000, max_grad_norm=0.5, no_cuda=False, noise_level=1.0, noisy_actions=1, noisy_odometry=1, num_episodes=1000000, num_global_steps=40, num_local_steps=25, num_mini_batch=0, num_processes=1, num_processes_on_first_gpu=0, num_processes_per_gpu=11, obs_threshold=1, obstacle_boundary=5, pose_loss_coeff=10000.0, ppo_epoch=4, pretrained_resnet=1, print_images=0, proj_loss_coeff=1.0, randomize_env_every=1000, save_interval=1, save_periodic=500000, save_trajectory_data='0', seed=1, short_goal_dist=1, sim_gpu_id=0, slam_batch_size=72, slam_iterations=10, slam_memory_size=500000, slam_optimizer='adam,lr=0.0001', split='val', task_config='tasks/pointnav_gibson.yaml', tau=0.95, total_num_scenes='auto', train_global=1, train_local=1, train_slam=1, use_deterministic_local=0, use_gae=False, use_pose_estimation=2, use_recurrent_global=0, use_recurrent_local=1, value_loss_coef=0.5, vis_type=1, vision_range=64, visualize=0)
Loading data/scene_datasets/gibson/Cantwell.glb
2021-03-24 01:48:47,936 initializing sim Sim-v0
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0324 01:48:47.955277 25783 WindowlessContext.cpp:98] [EGL] Detected 6 EGL devices
I0324 01:48:47.955494 25783 WindowlessContext.cpp:119] [EGL] Selected EGL device 1 for CUDA device 0
I0324 01:48:47.958079 25783 WindowlessContext.cpp:133] [EGL] Version: 1.5
I0324 01:48:47.958138 25783 WindowlessContext.cpp:134] [EGL] Vendor: NVIDIA
Renderer: TITAN V/PCIe/SSE2 by NVIDIA Corporation
OpenGL version: 4.6.0 NVIDIA 455.45.01
Using optional features:
GL_ARB_ES2_compatibility
GL_ARB_direct_state_access
GL_ARB_get_texture_sub_image
GL_ARB_invalidate_subdata
GL_ARB_multi_bind
GL_ARB_robustness
GL_ARB_separate_shader_objects
GL_ARB_texture_filter_anisotropic
GL_ARB_texture_storage
GL_ARB_texture_storage_multisample
GL_ARB_vertex_array_object
GL_KHR_debug
Using driver workarounds:
nv-egl-incorrect-gl11-function-pointers
no-layout-qualifiers-on-old-glsl
nv-zero-context-profile-mask
nv-implementation-color-read-format-dsa-broken
nv-cubemap-inconsistent-compressed-image-size
nv-cubemap-broken-full-compressed-image-query
nv-compressed-block-size-in-bits
I0324 01:48:53.312220 25783 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Cantwell.navmesh
2021-03-24 01:48:53,313 initializing task Nav-v0
2021-03-24 01:48:53,324 Computing map for data/scene_datasets/gibson/Cantwell.glb
Time: 00d 00h 00m 00s, num timesteps 0, FPS 0,
Rewards:
Losses:
Time: 00d 00h 00m 01s, num timesteps 10, FPS 8,
Rewards:
Losses: Local Loss: 5.300,
Time: 00d 00h 00m 05s, num timesteps 20, FPS 3,
Rewards:
Losses: Local Loss: 5.339,
Time: 00d 00h 00m 13s, num timesteps 30, FPS 2,
Rewards:
Losses: Local Loss: 5.373,
Time: 00d 00h 00m 20s, num timesteps 40, FPS 1,
Rewards:
Losses: Local Loss: 5.161,
Time: 00d 00h 00m 27s, num timesteps 50, FPS 1,
Rewards:
Losses: Local Loss: 5.256,
Time: 00d 00h 00m 34s, num timesteps 60, FPS 1,
Rewards:
Losses: Local Loss: 5.418,
Time: 00d 00h 00m 41s, num timesteps 70, FPS 1,
Rewards:
Losses: Local Loss: 5.383,
Time: 00d 00h 00m 55s, num timesteps 80, FPS 1,
Rewards:
Losses: Local Loss: 5.250, SLAM Loss proj/exp/pose:0.2753/0.4042/1.0404
Time: 00d 00h 01m 10s, num timesteps 90, FPS 1,
Rewards:
Losses: Local Loss: 5.295, SLAM Loss proj/exp/pose:0.1872/0.2451/0.5508
Time: 00d 00h 01m 25s, num timesteps 100, FPS 1,
Rewards:
Losses: Local Loss: 5.321, SLAM Loss proj/exp/pose:0.1511/0.1848/0.3863
...
I0324 01:18:32.724668 23107 WindowlessContext.cpp:98] [EGL] Detected 6 EGL devices
I0324 01:18:32.724862 23107 WindowlessContext.cpp:119] [EGL] Selected EGL device 0 for CUDA device 1
I0324 01:18:32.727066 23107 WindowlessContext.cpp:133] [EGL] Version: 1.5
I0324 01:18:32.727088 23107 WindowlessContext.cpp:134] [EGL] Vendor: NVIDIA
I0324 01:18:37.851421 23107 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Cantwell.navmesh
2021-03-24 01:18:37,854 initializing task Nav-v0
2021-03-24 01:18:37,876 Computing map for data/scene_datasets/gibson/Cantwell.glb
2021-03-24 01:41:46,300 Computing map for data/scene_datasets/gibson/Cantwell.glb
Traceback (most recent call last):
File "main.py", line 769, in <module>
main()
File "main.py", line 617, in main
g_agent.update(g_rollouts)
File "/home/xxx/Neural-SLAM/algo/ppo.py", line 58, in update
for sample in data_generator:
File "/home/xxx/Neural-SLAM/utils/storage.py", line 95, in feed_forward_generator
mini_batch_size = batch_size // num_mini_batch
ZeroDivisionError: integer division or modulo by zero
Exception ignored in: <function VectorEnv.__del__ at 0x7fc52a9e2ef0>
Traceback (most recent call last):
File "/home/xxx/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 487, in __del__
self.close()
File "/home/xxx/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 351, in close
write_fn((CLOSE_COMMAND, None))
File "/home/xxx/anaconda3/envs/comp765/lib/python3.7/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/xxx/anaconda3/envs/comp765/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/xxx/anaconda3/envs/comp765/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Could anyone please tell me what to do with this?
SJingwen commented
I met the same error. Have you solved it?
devendrachaplot commented
Please add --num_mini_batch 1
to the command, this should fix the error.
DEQDON commented
Thanks, this works!