fjxmlzn/DoppelGANger

generated_samples

fxctydfty opened this issue · 10 comments

I am able to run the training algorithm. But when I run the generating_data, it never creates the output in the "generated_samples" folder. I attached the worker log here. Could you please help me on that?
Thanks in advance.

worker_generate_data.log

The logs look normal. How long have this been stuck without generated_samples folder?

After it prints out "Finish Building". Nothing happened. I tried several times. Same thing.

I am running Python Version 3.7.10 and Tensorflow 1.14.0.

Could you please share example_generating_data/config_generate_data.py and example_training/config.py you are using?

*config_generate_data.py
config = {
"scheduler_config": {
"gpu": ["0"],
"config_string_value_maxlen": 1000,
"result_root_folder": "../results/",
"scheduler_log_file_path": "scheduler_generate_data.log",
"log_file": "worker_generate_data.log",
"force_rerun": True
},

"global_config": {
    "batch_size": 100,
    "vis_freq": 200,
    "vis_num_sample": 5,
    "d_rounds": 1,
    "g_rounds": 1,
    "num_packing": 1,
    "noise": True,
    "feed_back": False,
    "g_lr": 0.001,
    "d_lr": 0.001,
    "d_gp_coe": 10.0,
    "gen_feature_num_layers": 1,
    "gen_feature_num_units": 100,
    "gen_attribute_num_layers": 3,
    "gen_attribute_num_units": 100,
    "disc_num_layers": 5,
    "disc_num_units": 200,
    "initial_state": "random",

    "attr_d_lr": 0.001,
    "attr_d_gp_coe": 10.0,
    "g_attr_d_coe": 1.0,
    "attr_disc_num_layers": 5,
    "attr_disc_num_units": 200,

    "generate_num_train_sample": 50000,
    "generate_num_test_sample": 50000
},

"test_config": [
    {
        "dataset": ["web"],
        "epoch": [2],
        "run": [0, 1, 2],
        "sample_len": [1, 5],
        "extra_checkpoint_freq": [5],
        "epoch_checkpoint_freq": [1],
        "aux_disc": [False],
        "self_norm": [False]
    }
]

}

**config.py
config = {
"scheduler_config": {
"gpu": ["0","1"],
"config_string_value_maxlen": 1000,
"result_root_folder": "../results/"
},

"global_config": {
    "batch_size": 100,
    "vis_freq": 200,
    "vis_num_sample": 5,
    "d_rounds": 1,
    "g_rounds": 1,
    "num_packing": 1,
    "noise": True,
    "feed_back": False,
    "g_lr": 0.001,
    "d_lr": 0.001,
    "d_gp_coe": 10.0,
    "gen_feature_num_layers": 1,
    "gen_feature_num_units": 100,
    "gen_attribute_num_layers": 3,
    "gen_attribute_num_units": 100,
    "disc_num_layers": 5,
    "disc_num_units": 200,
    "initial_state": "random",

    "attr_d_lr": 0.001,
    "attr_d_gp_coe": 10.0,
    "g_attr_d_coe": 1.0,
    "attr_disc_num_layers": 5,
    "attr_disc_num_units": 200,
},

"test_config": [
    {
        "dataset": ["web"],
        "epoch": [1],
        "run": [0, 1, 2],
        "sample_len": [1, 5],
        "extra_checkpoint_freq": [5],
        "epoch_checkpoint_freq": [1],
        "aux_disc": [False],
        "self_norm": [False]
    }
]

}

I see where the problem comes from. example_generating_data/gan_generate_data_task.py generates data for the mid-checkpoints. In your config.py, you train the model for only 1 epoch "epoch": [1],, and the frequency for saving mid-checkpoints is 5 "extra_checkpoint_freq": [5],, so the code didn't save any mid-checkpoints at all, thus it didn't generate samples.

If you want to generate data from the last checkpoint instead, you can delete these lines

for epoch_id in range(self._config["extra_checkpoint_freq"] - 1,
self._config["epoch"],
self._config["extra_checkpoint_freq"]):
print("Processing epoch_id: {}".format(epoch_id))
mid_checkpoint_dir = os.path.join(
checkpoint_dir, "epoch_id-{}".format(epoch_id))
if not os.path.exists(mid_checkpoint_dir):
print("Not found {}".format(mid_checkpoint_dir))
continue
save_path = os.path.join(
self._work_dir,
"generated_samples",
"epoch_id-{}".format(epoch_id))
, reverse 4 spaces for the rest part of code, and set mid_checkpoint_dir = checkpoint_dir, and save_path = checkpoint_dir

I increased the epoch size to 20. Now I have the different error while training. Could you please take a look the log file.

worker.log

It seems like you are running on a Windows system. Could you change

"result_root_folder": "../results/"
and
"result_root_folder": "../results/",
to "result_root_folder": "..\\results\\"and try again?

Hey
Its working now. thanks for your help.