Not able to submit multiple jobs running scglue
Closed this issue · 4 comments
Hi,
I am trying to submit multiple jobs to process scglue model run. The first job finishes on time but the rest of the jobs are stuck. Is this because of they are using the same pretrain directory? That is the last message I get for those runs that are stuck.
Thanks for your feedback.
Sorry for half message. I now have added the output from the run that I get int ".out" file:
Performing analysis using predicated lables, v1 approach :
[INFO] fit_SCGLUE: Pretraining SCGLUE model...
[INFO] autodevice: Using CPU as computation device.
[INFO] check_graph: Checking variable coverage...
[INFO] check_graph: Checking edge attributes...
[INFO] check_graph: Checking self-loops...
[INFO] check_graph: Checking graph symmetry...
[INFO] SCGLUEModel: Setting `graph_batch_size` = 39871
[INFO] SCGLUEModel: Setting `max_epochs` = 100
[INFO] SCGLUEModel: Setting `patience` = 9
[INFO] SCGLUEModel: Setting `reduce_lr_patience` = 5
[INFO] SCGLUETrainer: Using training directory: "glue/pretrain"
The error message in ".err" file. However, this could be the result of purposely killing the job
/software/miniconda/4.9.2/lib/python3.9/abc.py:98: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
return _abc_instancecheck(cls, instance)
/home/p541i/DATA/packages/scglue/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "
/software/miniconda/4.9.2/lib/python3.9/abc.py:98: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
return _abc_instancecheck(cls, instance)
Exception in thread Thread-1:
Traceback (most recent call last):
File "/software/miniconda/4.9.2/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/home/p541i/DATA/packages/scglue/lib/python3.9/site-packages/tensorboardX/event_file_writer.py", line 219, in run
self._record_writer.flush()
File "/home/p541i/DATA/packages/scglue/lib/python3.9/site-packages/tensorboardX/event_file_writer.py", line 69, in flush
self._py_recordio_writer.flush()
File "/home/p541i/DATA/packages/scglue/lib/python3.9/site-packages/tensorboardX/record_writer.py", line 193, in flush
self._writer.flush()
OSError: [Errno 116] Stale file handle
Terminated
Hi @piyushjo15. Thank you for your interest in GLUE! Did you forget to append the messages you get?
Hi @Jeff1995 I just want to follow up with this.
Sorry for the late reply. The output does not show anything strange other than the tensorboard log writer getting a stale file handle. Did you try specifying different log directories when training the models? This can be done via the fit_kws
argument in fit_SCGLUE
:
scglue.models.fit_SCGLUE(..., fit_kws={"directory": "/path/to/directory"})
Thank you so much for your suggestion. Indeed this small silly problem was the root of the issue. It works fine now.