Not able to submit multiple jobs running scglue

Question

Not able to submit multiple jobs running scglue

Closed this issue 6 months ago · 4 comments

Hi,

I am trying to submit multiple jobs to process scglue model run. The first job finishes on time but the rest of the jobs are stuck. Is this because of they are using the same pretrain directory? That is the last message I get for those runs that are stuck.

Thanks for your feedback.

Sorry for half message. I now have added the output from the run that I get int ".out" file:

Performing analysis using predicated lables, v1 approach :
[INFO] fit_SCGLUE: Pretraining SCGLUE model...
[INFO] autodevice: Using CPU as computation device.
[INFO] check_graph: Checking variable coverage...
[INFO] check_graph: Checking edge attributes...
[INFO] check_graph: Checking self-loops...
[INFO] check_graph: Checking graph symmetry...
[INFO] SCGLUEModel: Setting `graph_batch_size` = 39871
[INFO] SCGLUEModel: Setting `max_epochs` = 100
[INFO] SCGLUEModel: Setting `patience` = 9
[INFO] SCGLUEModel: Setting `reduce_lr_patience` = 5
[INFO] SCGLUETrainer: Using training directory: "glue/pretrain"

The error message in ".err" file. However, this could be the result of purposely killing the job

/software/miniconda/4.9.2/lib/python3.9/abc.py:98: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
/home/p541i/DATA/packages/scglue/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
  warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "
/software/miniconda/4.9.2/lib/python3.9/abc.py:98: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/software/miniconda/4.9.2/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/home/p541i/DATA/packages/scglue/lib/python3.9/site-packages/tensorboardX/event_file_writer.py", line 219, in run
    self._record_writer.flush()
  File "/home/p541i/DATA/packages/scglue/lib/python3.9/site-packages/tensorboardX/event_file_writer.py", line 69, in flush
    self._py_recordio_writer.flush()
  File "/home/p541i/DATA/packages/scglue/lib/python3.9/site-packages/tensorboardX/record_writer.py", line 193, in flush
    self._writer.flush()
OSError: [Errno 116] Stale file handle
Terminated

Answer 1 · 2024-05-10T03:27:16.000Z

Hi @piyushjo15. Thank you for your interest in GLUE! Did you forget to append the messages you get?

Answer 2 · 2024-05-31T20:13:30.000Z

Hi @Jeff1995 I just want to follow up with this.

Answer 3 · 2024-06-13T03:30:42.000Z

Sorry for the late reply. The output does not show anything strange other than the tensorboard log writer getting a stale file handle. Did you try specifying different log directories when training the models? This can be done via the fit_kws argument in fit_SCGLUE:

scglue.models.fit_SCGLUE(..., fit_kws={"directory": "/path/to/directory"})

Answer 4 · 2024-06-18T11:57:39.000Z

Thank you so much for your suggestion. Indeed this small silly problem was the root of the issue. It works fine now.