usnistgov/ARIAC

automate eval error

dan9thsense opened this issue · 6 comments

Downloaded the latest docker image and pulled latest updates from automated eval.
When running the trials, they intermittently generate the error below. Almost always happens at some point with very long runs, multiple orders, etc. Usually does not occur with short runs such as the example kitting trial. I tried rebuilding the container, but it did not help. The same trial runs fine outside of docker.

Things seemed to be going along ok, then suddenly:

terminate called after throwing an instance of 'boost::wrapexcept<boost::thread_resource_error>'
  what():  boost thread: trying joining itself: Resource deadlock avoided
Aborted (core dumped)
Gazebo not running
==== Trial assembly_all_stations completed
return_code: -9
==== Copying logs to
Successfully copied 2.05kB to /home/dbarry/ariac_ws/src/ARIAC_evaluation/automated_evaluation/logs/sirius/assembly_all_stations_5/trial_log.txt
Successfully copied 2.05kB to /home/dbarry/ariac_ws/src/ARIAC_evaluation/automated_evaluation/logs/sirius/assembly_all_stations_5/sensor_cost.txt
Error response from daemon: Could not find the file /tmp/state.log in container sirius
Successfully copied 174kB to /home/dbarry/ariac_ws/src/ARIAC_evaluation/automated_evaluation/logs/sirius/assembly_all_stations_5/ros_log/

Running today, it happened right at the beginning of a simple assembly trial. It does not happen with simple kitting trials.

I do see some environment errors at startup: many of this type but I don't know if they are related or just a red heering:

[gzserver-1] Error [parser_urdf.cc:3183] Unable to call parseURDF on robot model
[gzserver-1] Error [parser.cc:488] parse as old deprecated model file failed.
[gzclient-2] [INFO] [1713741776.049472237] [gazebo_ros_node]: ROS was initialized without arguments.

I am not sure about the first error. It seems like gazebo crashed during running the trial and that the automated evaluation handled the error. We are working on an update to the automated evaluation now that reruns any trials where gazebo crashes. This is what we will be using for running the smoke test qualifiers. Since it is only happening on long runs it might be an issue with the container running out of memory.

The second error is known and is due to the record_state argument being set to true in ariac.launch.py. This allows gazebo to record the state.log file which is used for trial playback after runtime. The errors do not seem to cause any issues however so they can be ignored.

I thought of an OOM error in Docker, but I've seen those and you get the error code in the docker logs when that happens. Nevertheless, that is a possibility and it's easy enough to test when you setup docker for the runs. Unfortunately, I don't think it is that simple, as the runs were going OK before the latest image (using the older image and updating by pulling from the repo). Also, I had a recent run that crashed right at the start of the assembly task, not waiting until late in a long run.

It will be interesting to see what the smoke tests show.

We did not run into any major issues running the smoke test, our scripts to rerun trials if there was an error seem to have worked for all the smoke test trials. Results are available in each team's submission folder.

Closing this issue for now. Feel free to reopen if you think it is a problem in the future.

Just FYI, this problem continues. I did a clean install of Ubuntu 22.04 to see if there was something odd going on with my docker setup, but even with just docker, ROS2, and ARIAC installed, the crashes are frequent enough to prevent doing run-all with the automated evaluation scripts. Here is another example:

[gzserver-1] [libprotobuf FATAL /usr/include/google/protobuf/repeated_field.h:1694] CHECK failed: (index) < (current_size_): 
[gzserver-1] terminate called after throwing an instance of 'google::protobuf::FatalException'
[gzserver-1]   what():  CHECK failed: (index) < (current_size_): 
Traceback (most recent call last):
  File "/container_scripts/run_trial.py", line 110, in <module>
    main()
  File "/container_scripts/run_trial.py", line 84, in main
    output = subprocess.check_output(
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.10/subprocess.py", line 1154, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.10/subprocess.py", line 2022, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)

Encountered the error while running outside of docker, which eliminates docker and the docker image as the source of the bug. This occurred as AGV4 was moving toward the warehouse, with nothing else going on.

[gzserver-1] [libprotobuf FATAL /usr/include/google/protobuf/repeated_field.h:1694] CHECK failed: (index) < (current_size_):
[gzserver-1] terminate called after throwing an instance of 'google::protobuf::FatalException'
[gzserver-1] what(): CHECK failed: (index) < (current_size_):
[ERROR] [gzserver-1]: process has died [pid 33218, exit code -6, cmd 'gzserver /home/dbarry/ariac_ws/install/ariac_gazebo/share/ariac_gazebo/worlds/ariac.world -slibgazebo_ros_init.so -slibgazebo_ros_factory.so -slibgazebo_ros_force_system.so'].