Intermittent errors (issue migrated from ETS Gitlab)
dan-blanchard opened this issue · 5 comments
Occasionally, we get errors like this:
Error while unpickling output for pythongrid job 0 from stored with key output_71c951d7-6ba0-41a0-8267-f31ef7130c64_0
This could caused by a problem with the cluster environment, imports or environment variables.
Try running `pythongrid.py 71c951d7-6ba0-41a0-8267-f31ef7130c64 0 /home/nlp-text/dynamic/mheilman/sklearn_wrapper /scratch/ loki.research.ets.org` to see if your job crashed before writing its output.
Check log files for more information:
stdout: /scratch/Sidewalks_1_Rider_Item_4a_train.cv.o1718937
stderr: /scratch/Sidewalks_1_Rider_Item_4a_train.cv.e1718937
Exception: must be string or buffer, not None
The strange thing is that if you run the command, you can see the job did not crash before writing its output, and if you connect to the underlying Redis server, you can see that the data is there. It's just that sometimes, for whatever reason, the reading the data from the server fails (even though we have multiple retries built-in to deal with synchronization problems).
I should note that this happens pretty infrequently, but when it does, it's really annoying.
I just added some more detailed error printing for when this happens, and it seems like it may actually have nothing to do with Redis and everything to do with the SGE flaking out every once in a while.
👍
The issue seems to be that our installation of SGE likes to tell DRMAA that jobs have finished when they haven't, so it tries to retrieve results when they're not there. This should be fixed when PR #11 is merged.
This was fixed by #12.