cannot run PACO train on prod (vGPU server)
Closed this issue · 3 comments
We can clearly access GPU now with production server. We are able to do classifying, but PACO training always fails with such message, with the same workflow and input files finished successfully on staging:
Task Training model for Patchwise Analysis of Music Document, Training[eacd36d5-c8dd-4b02-b9cd-38ca31c92959] raised unexpected: RuntimeError("The job did not produce the output file for Background Model.\n\n{'Log File': [{'resource_type': 'text/plain', 'uuid': UUID('c031c7c9-c86d-481f-ae62-59f0b2491828'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/0fd193e7-8b3d-45a0-84d8-b99c0b2b8fc0'}], 'Background Model': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('c7a3b8ca-429a-4fcd-be92-2957e00497ba'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/faebcd1f-c384-44b9-9572-25f5ac27b12e'}], 'Model 1': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('eea42920-5100-4a57-9604-7688d582b482'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/a69ba529-0053-4c7f-bc3d-333942300b15'}], 'Model 2': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('3fd388db-1cf3-4400-9fb7-712bd3f6738e'), 'is_list': False, 'resource_temp_path': '/tmp/tmpa8eq61gp/03ac3bb9-3323-42a0-b989-5eadae3a0529'}]}")
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 412, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 704, in __protected_call__
return self.run(*args, **kwargs)
File "/code/Rodan/rodan/jobs/base.py", line 843, in run
).format(opt_name, outputs)
RuntimeError: The job did not produce the output file for Background Model.
Thought it was a out-of-memory issue, so I didn't think too much of it as we are still waiting for the larger vGPU instance. However, after testing and further looking into this, it seems a different thing. Closed the vGPU driver issue (#1170) and work on this instead.
Related:
I'm baffled. The directory tmpa8eq61gp
was created successfully with all necessary permissions, but no hdf5
files were written during the process. There was also no other logs or error messages to help identify the exact cause. Since the same thing can run without any issue on staging, I don't think it's a bug from the PACO repo. /rodan-main/code/rodan/jobs/base.py
was also just checking if the file exists. So it looks like it might be a bug with the Rodan PACO wrapper or something else.
This is strange also because, if the training has not started, it does not write hdf5
yet, of course. However, in this case, it is the missing output file that seems to prevent training.
The same error is reproduced on local machine with intel chip (where we don't have the GPU container problem for M for arm machines).
Since we are likely to eventually use the distributed version for Rodan prod (because everything else has been successfully set up #1184) I will do all the testing on the current single-instance version (rodan2.simssa.ca).
The issue is now transferred to Paco repo here.
Can do training now