An issue, possibly relating to the use of a custom whitelist of machines
dmnapolitano opened this issue · 2 comments
Howdy! While evaluating all the texts yesterday afternoon/evening, I received the following error:
Traceback (most recent call last):
File "SR5_Batch.py", line 77, in <module>
results = results + gridmap.process_jobs(jobs_queue, temp_dir=TEMP_DIR, white_list=WHITE_LIST, quiet=False)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 773, in process_jobs
monitor.check(sid, jobs)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 372, in check
self.check_if_alive()
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 429, in check_if_alive
handle_resubmit(self.session_id, job, temp_dir=self.temp_dir)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 584, in handle_resubmit
job.white_list.remove(node_name)
ValueError: list.remove(x): x not in list
Nothing actually died or anything, so I just kinda left it to see what would happen, and discovered hours later that nothing had actually happened. My gridmapping just kinda hung there.
The interesting thing is that this doesn't seem to happen consistently. I was able to evaluate ~60 texts just fine, but when I went for 1180, in batches of 100, I received this at around the 9th batch. When I can, I'll test this again to see how duplicable it is. 😕
When I can do things with stuff, I'll try putting the full name of the machines in my whitelist, and see what happens.
Yeah, you always need to specify the fully qualified domain name of any host in the whitelist. I'll change it so it doesn't outright crash in the future though.