netsharecmu/NetShare

ValueError: Variable DoppelGANgerGenerator/attribute_real/layer0/linear/matrix/Adam/ already exists, disallowed.

lurw2000 opened this issue · 5 comments

I have just follow the instructions and run the script driver.py. Here is the error message:

Traceback (most recent call last):
  File "/home/runwei/NetShare/netshare/models/model.py", line 27, in train
    log_folder=log_folder)
  File "/home/runwei/NetShare/netshare/models/doppelganger_tf_model.py", line 176, in _train
    gan.build()
  File "/home/runwei/NetShare/netshare/models/doppelganger_tf/doppelganger.py", line 293, in build
    self.build_loss()
  File "/home/runwei/NetShare/netshare/models/doppelganger_tf/doppelganger.py", line 708, in build_loss
    self.g_loss, var_list=self.generator.trainable_vars
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/optimizer.py", line 413, in minimize
    name=name)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/optimizer.py", line 597, in apply_gradients
    self._create_slots(var_list)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/adam.py", line 131, in _create_slots
    self._zeros_slot(v, "m", self._name)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/optimizer.py", line 1156, in _zeros_slot
    new_slot_variable = slot_creator.create_zeros_slot(var, op_name)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py", line 190, in create_zeros_slot
    colocate_with_primary=colocate_with_primary)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py", line 164, in create_slot_with_initializer
    dtype)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py", line 74, in _create_slot_var
    validate_shape=validate_shape)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1500, in get_variable
    aggregation=aggregation)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1243, in get_variable
    aggregation=aggregation)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 567, in get_variable
    aggregation=aggregation)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 519, in _true_getter
    aggregation=aggregation)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 868, in _get_single_variable
    (err_msg, "".join(traceback.format_list(tb))))
ValueError: Variable DoppelGANgerGenerator/attribute_real/layer0/linear/matrix/Adam/ already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/runwei/anaconda3/envs/netshare/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)

Hi, thank you for your interest in NetShare. To reproduce the error on our end, could you please let us know:

  1. Which dataset are you running? We have not fully tested the framework in the main branch but we never saw this error before.
  2. Are you running NetShare on a cluster or on a single machine?
  3. Is Ray package installed and turned on (ray.config.enabled=True)?

I'm running on a single machine(Ubuntu 22.04.1) and turn Ray off.
The driver.py looks:

import netshare.ray as ray
from netshare import Generator

if __name__ == '__main__':
    ray.config.enabled = False

    generator = Generator(config="netflow/config_example_netflow_nodp.json")
    generator.train_and_generate(work_folder='../results/netflow/test')

Thanks. We will look into it and get back to you.

Sorry for the delay. It took us some time to pinpoint the issue as we mainly use Ray=ON and a cluster for dev/test.

The problem is that when Ray is OFF and using a single machine, everything will be running sequentially such that there are multiple TF instances in the same process, which will cause the "graph exists" error.

The solution is to add the following code snippet to train/generate function to reset the TF graph every time it starts:

# If Ray is disabled, reset TF graph
if not ray.config.enabled:
tf.reset_default_graph()

# If Ray is disabled, reset TF graph
if not ray.config.enabled:
tf.reset_default_graph()

We have updated the scripts and README. Please pull the latest codebase and check the README and let us know if you encounter any problems further.

Side note: running on a single machine with Ray=OFF will take infinitely long for the code to finish. We would recommend using a cluster if possible. Alternatively, for quick validation purposes regardless of fidelity, you may follow Tip 1 of Example Usage to set a very small training iteration number to get a sense of running NetShare end-to-end.

Just close this issue since there is not further update on that. Feel free to create a new one or reopen if has any other questions.