Can not save the result when it is done

Question

Can not save the result when it is done

duohongrui opened this issue 2 years ago · 7 comments

Hi, scGAN is a creative simulation framework to simulate single-cell RNA-seq datasets. These days I tried scGAN on a server with GeForce 2080Ti. No error occurred until the last step of saving the result to the path. I ran the test in a docker container.
Here is my codes to simulate cells and save the result:

python main.py --param parameters.json --generate --cells_no 1000 500 0 200 --model_path /scGAN/use_scGAN --save_path /scGAN/test.h5ad

Here is the parameters.json:

{
    "exp_param": {
        "experiments_dir": "/scGAN",
        "GPU": [
            1
        ]
    },
    "experiments": {
        "use_scGAN": {
            "input_ds": {
                "clustering": {
                    "res": 0.15
                },
                "filtering": {
                    "min_cells": 0,
                    "min_genes": 0
                },
                "raw_input": "/scGAN/input_data.h5ad",
                "scale": "normalize_per_cell_LS_20000",
                "split": {
                    "balanced_split": true,
                    "split_seed": "default",
                    "test_cells": 0,
                    "valid_cells": 2000
                }
            },
            "training": {
                "max_steps": 10000,
                "learning_rate": {
                    "decay": true,
                    "alpha_0": 0.0001,
                    "alpha_final": 1e-05
                },
                "optimizer": {
                    "algorithm": "AMSGrad",
                    "beta1": 0.5,
                    "beta2": 0.9
                },
                "batch_size": 128,
                "critic_iters": 5,
                "checkpoint": null,
                "progress_freq": 10,
                "validation_freq": 1000,
                "save_freq": 1000,
                "summary_freq": 50
            },
            "model": {
                "type": "cscGAN",
                "latent_dim": 128,
                "output_LSN": 20000,
                "critic_layers": [
                    1024,
                    512,
                    256
                ],
                "gen_layers": [
                    256,
                    512,
                    1024
                ],
                "critic_cond_type": "proj",
                "gen_cond_type": "batchnorm",
                "lambd": 10
            }
        }
    }
}

During the process of tracing this error, I found some clues:

Line109 in main.py, run_exp function lacked the save_cells_path parameter which is defined in run_exp.py and that means save_cells_path always be none whatever user input. I added the parameter in run_exp function, but the error still existed.
Line588 in cscGAN.py, the error said tf.train.latest_checkpoint failed. But how the checkpoint can affect the save path? And why the save path is none?
Could you please help me solve the problem? Thanks very much!

Answer 1 · 2022-08-01T14:40:37.000Z

Hello @duohongrui,

Line109 in main.py, run_exp function lacked the save_cells_path parameter which is defined in run_exp.py and that means save_cells_path always be none whatever user input. I added the parameter in run_exp function, but the error still existed.

Good catch, I'll fix that.

Line588 in cscGAN.py, the error said tf.train.latest_checkpoint failed. But how the checkpoint can affect the save path? And why the save path is none?

This should not be influenced by save_cells_path and save_path is different to save_cells_path.
save_path is an argument to saver.restore like here and generated by tf.train.latest_checkpoint. I guess it is because the training failed or finished too early.
Could you check your experiment directory and see if there are finished checkpoints ?

Answer 2 · 2022-08-01T15:26:16.000Z

Hi, @fhausmann ,
There are no other new files in the experiment directory. Maybe the training step failed but no messages for me. Could you tell me what files will appear if the training finished successfully?

Answer 3 · 2022-08-01T15:53:56.000Z

in your experiment folder there should be a folder called job with several files generated by tensorflow.
You parameters file indicate that you're writing everything to /scGAN. Did you map this folder from docker to you systems directory ? Otherwise it could be written in the Docker container only, which means it is gone after rerunning a new command in a new container.

Answer 4 · 2022-08-02T03:47:01.000Z

Yes, I just set the mounting point to /scGAN in the docker container and set /scGAN in experiment dir parameter in order to get the result directly on my local device. Anyway, after the training step, there is no job dir in /scGAN but in /scGAN/use_scGAN.
Here is the tree structure of files after training:

I guess that may be there is something wrong with the experiment dir, and I will set a different experiment dir path to try again. After trying again I will tell you about the result.
Thanks very much!

Answer 5 · 2022-08-02T06:09:42.000Z

The directory structure looks fine. The directory path is constructed from "experiments_dir" (/scGAN) and the name of the experiment (use_scGAN). However, there are too less files in the job directory, so I guess something went wrong during training.

Answer 6 · 2022-08-02T08:32:43.000Z

OK, the most likely that the training step failed. I will have another try. Thanks very much!

Answer 7 · 2022-08-15T07:07:35.000Z

Hi, @fhausmann. I have tried again and this time the simulated dataset was generated successfully.
Thanks for your patient responses.