NVIDIA/runx

Is it possible to use runx on NGC?

XinDongol opened this issue · 14 comments

Is it possible to use runx on NGC?
ajtao commented

Yes it is. I have a little bit of support that i'm getting ready, should be able to release it shortly.

Yes it is. I have a little bit of support that i'm getting ready, should be able to release it shortly.

It is great to know that. Thanks a lot for bringing such a useful tool for the community.

ajtao commented

Hi @XinDongol , i've just pushed NGC support. Maybe you could try it out and provide some feedback.

I am trying to submit jobs to NGC with runx.
I am kind of confused with the .runx file. Where should I put the .runx file?

Currently, I put the .runx file in the same dir as the .py file and the sweep.yml file.

.runx:

LOGROOT: /home/xxx/runx
FARM: ngc

ngc:
    NGC_LOGROOT: /myws
    WORKSPACE: xxxx
    SUBMIT_CMD: 'ngc batch run'
    RESOURCES:
       image: nvidian/pytorch:19.10-py3
       gpu: 1
       instance: dgx1v.16g.1.norm
       ace: nv-us-west-2
       result: /result

sweep.yml:

CMD: 'bash /myws/codes/ngc_comm/install.sh;python boostrap.py'

HPARAMS:
  batch_size: [256]
  epoch: [100]
  lr: [0.01]
  subset_pct: [0.1, 0.5]
  di_batch_size: [256]
  num_di_batch: [200]
  logdir: LOGDIR

Then, I call

python -m runx.runx sweep.yml -n

to check commands.

Error message:

Traceback (most recent call last):
  File "/home/xin/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/xin/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/runx.py", line 394, in <module>
    main()
  File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/runx.py", line 387, in main
    run_experiment(args.exp_yml)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/runx.py", line 361, in run_experiment
    experiment = read_config(args.farm, args.exp_yml)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/utils.py", line 96, in read_config
    global_config = read_config_file()
  File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/utils.py", line 86, in read_config_file
    global_config = yaml.load(open(config_fn), Loader=yaml.FullLoader)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/__init__.py", line 114, in load
    return loader.get_single_data()
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/constructor.py", line 41, in get_single_data
    node = self.get_single_node()
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 36, in get_single_node
    document = self.compose_document()
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 55, in compose_document
    node = self.compose_node(None, None)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 84, in compose_node
    node = self.compose_mapping_node(anchor)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 133, in compose_mapping_node
    item_value = self.compose_node(node, item_key)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 84, in compose_node
    node = self.compose_mapping_node(anchor)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 133, in compose_mapping_node
    item_value = self.compose_node(node, item_key)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 84, in compose_node
    node = self.compose_mapping_node(anchor)
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 127, in compose_mapping_node
    while not self.check_event(MappingEndEvent):
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/parser.py", line 428, in parse_block_mapping_key
    if self.check_token(KeyToken):
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/scanner.py", line 223, in fetch_more_tokens
    return self.fetch_value()
  File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/scanner.py", line 579, in fetch_value
    self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
  in "./.runx", line 14, column 23

Thanks a lot.

ajtao commented

Currently, I put the .runx file in the same dir as the .py file and the sweep.yml file

Correct, you put it in whichever directory that you run the runx command.

Not sure if you have installed the latest runx, but i just tried your example and it works for me:

$ python -m runx.runx sweep.yml -n
ngc batch run --image nvidian/pytorch:19.10-py3 --gpu 1 --instance dgx1v.16g.1.norm --ace nv-us-west-2 --result /result  --name sweep_attentive-mongoose_2020.10.18_13.35 --commandline ' cd /myws/sweep/attentive-mongoose_2020.10.18_13.35/code; PYTHONPATH=/myws/sweep/attentive-mongoose_2020.10.18_13.35/code  bash /myws/codes/ngc_comm/install.sh;python boostrap.py --batch_size 256 --epoch 100 --lr 0.01 --subset_pct 0.1 --di_batch_size 256 --num_di_batch 200 --logdir /myws/sweep/attentive-mongoose_2020.10.18_13.35  ' --workspace xxxx:/myws:RW
ngc batch run --image nvidian/pytorch:19.10-py3 --gpu 1 --instance dgx1v.16g.1.norm --ace nv-us-west-2 --result /result  --name sweep_flying-bumblebee_2020.10.18_13.35 --commandline ' cd /myws/sweep/flying-bumblebee_2020.10.18_13.35/code; PYTHONPATH=/myws/sweep/flying-bumblebee_2020.10.18_13.35/code  bash /myws/codes/ngc_comm/install.sh;python boostrap.py --batch_size 256 --epoch 100 --lr 0.01 --subset_pct 0.5 --di_batch_size 256 --num_di_batch 200 --logdir /myws/sweep/flying-bumblebee_2020.10.18_13.35  ' --workspace xxxx:/myws:RW

Maybe you could double check that there aren't tabs in the yaml on line 14, because the yaml reader can sometimes get confused.

Thanks a lot. It turns out it was caused by 'tab' stuff in the yaml. I will test other functions on ngc

One quick question. How to config .runx if I just want to run codes locally? I tried to leave FARM as blank but got some error.

ajtao commented

Don't leave FARM blank. FARM should point to some farm definition. The definition contents just needs to contain dummy values for RESOURCES and SUBMIT_CMD, but that's about it.

But to run locally, just use -i, for interactive.

This is my current file structure.

project_dir/
	.runx
	sweep.yml
	main.py

In the .runx, I only specify the LOGDIR to make sure that logs from runx are written in the right dir.

.runx file:

LOGROOT: /home/jovyan/codes/Octopy/di_fl/cifar10

Then, I run

python -m runx.runx sweep.yml -i -n

And got this error,

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 394, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 387, in main
    run_experiment(args.exp_yml)
  File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 361, in run_experiment
    experiment = read_config(args.farm, args.exp_yml)
  File "/opt/conda/lib/python3.6/site-packages/runx/utils.py", line 101, in read_config
    farm_name = read_config_item(global_config, 'FARM')
  File "/opt/conda/lib/python3.6/site-packages/runx/utils.py", line 72, in read_config_item
    raise f'can\'t find {key} in config'
TypeError: exceptions must derive from BaseException

I also tried to delete the .runx, but still had error

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 394, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 387, in main
    run_experiment(args.exp_yml)
  File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 361, in run_experiment
    experiment = read_config(args.farm, args.exp_yml)
  File "/opt/conda/lib/python3.6/site-packages/runx/utils.py", line 96, in read_config
    global_config = read_config_file()
  File "/opt/conda/lib/python3.6/site-packages/runx/utils.py", line 84, in read_config_file
    raise('can\'t find file ./.runx or ~/.config/runx.yml config files')
TypeError: exceptions must derive from BaseException
ajtao commented

Sorry, this should work better. I'll have to clean this up.

For the time being, please try something like this for .runx:

LOGROOT: /home/jovyan/codes/Octopy/di_fl/cifar10

FARM: fake

fake:
    SUBMIT_CMD: na
    RESOURCES:
        dummy: na

I am really enjoying runx on ngc and local machine these days. It is quite useful to find good hyperparameters. 💯 👍


I was wondering whether it is possible to make Staging of code as a user-specified option.
Basically, it is nice to stage codes as mentioned in readme.md.
However, staging may take forever sometimes. For example,

  1. there are some big files in the dir.
  2. Uploading to ngc workspace is slow depending on the connection.
  3. The main.py that we want to run needs to import something from its parent dir (like import ..utils).

Currently, the flow is,

  1. upload codes to specific dir on ngc workspace
  2. cd this specific dir
  3. call python main.py

An example,

ngc batch run --image nvidian/pytorch:20.05-py3-base --instance dgx1v.32g.1.norm 
--ace nv-us-west-2 --result /result --team xx --org nvidian  
--name fixed_non_iid_di_setting_fast-ringtail_2020.10.21_04.08 

--commandline ' cd /myws/fixed_non_iid_di_setting/fast-ringtail_2020.10.21_04.08/code; 
PYTHONPATH=/myws/fixed_non_iid_di_setting/fast-ringtail_2020.10.21_04.08/code; 
python non_iid_di.py --logdir /myws/fixed_non_iid_di_setting/fast-ringtail_2020.10.21_04.08' 

--workspace xxxx:/myws:RW

Without staging, we can cd into the code dir manually by adding commands to CMD in sweep.yml like this,

CMD: ' cd /myws/codes/Octopy/cifar10; python non_iid_di.py'
HPARAMS:
    logdir: LOGDIR
    rounds: [100]
    num_devices: [20]
    device_pct: [1.0]
    non_iid: [1]
    scheduler: ['multistep', 'cosine']

Then, we can get the correct job submission command easily,

ngc batch run --image nvidian/pytorch:20.05-py3-base --instance dgx1v.32g.1.norm 
--ace nv-us-west-2 --result /result --team xx --org nvidian  
--name fixed_non_iid_di_setting_fast-ringtail_2020.10.21_04.08 

--commandline ' cd /myws/codes/Octopy/cifar10; 
python non_iid_di.py --logdir /myws/fixed_non_iid_di_setting/fast-ringtail_2020.10.21_04.08' 

--workspace xxxx:/myws:RW
ajtao commented

Hi @XinDongol , i'm really happy to hear that you're finding runx useful.

I understand that the upload is taking a while. Can I ask: in your proposal, are suggesting that all runs will use the same code dir in NGC? In other words, if you have multiple runs, they will use the same directory in NGC?

The issue with using the same directory in NGC is that it kind of breaks the paradigm of one run per directory. It also makes it challenging to use either tensorboard or sumx to view each run individually.

Do the large files within your code directory need to be staged/uploaded? I wonder if (a) they could be put into a central place so they don't need to be copied or (b) if they really aren't needed, then could you use CODE_IGNORE_PATTERNS in .runx to exclude them from being copied?

Different runs will still use their own dir because they have different --logdir.

Suppose my code is in the dir of /client/mycode/ on the client.

In current flow,

  1. copy/upload codes from /client/mycode to /ngc_ws/logdir/code/ on ngc workspace
  2. cd /ngc_ws/logdir/code/
  3. run codes from /ngc_ws/logdir/code/
  4. write tensorboard/csv log to logdir

If the connection to ngc is slow (for example, yesterday), uploading n times (n is the number of runs) will take a lot of time.


In my proposal, if the code is already at certain dir of /ngc_ws/mycode on the ngc workspace, step 1,2,3 can be replaced by cd /ngc_ws/mycode. So we can skip copying n times. Different runs will still use their own dir because they have different --logdir.

The only difference is that the uploading solution use different PYTHONPATH='/ngc_ws/logdir/code' for different runs, but the no-uploading solution will use the same PYTHONPATH='/ngc_ws/mycode' for different runs.


Personally, I like the idea of staging. Making the staging as an option would give users more flexibility to handle bad ngc connection.

ajtao commented

I see, yes, it makes sense.

One reason why we created a separate copy of code per run, and then upload a separate copy of code per run, it is because sometimes you will want to locally change code and then using the same experiment.yaml file, add some new runs. Now your experiement directory will contain multiple runs, some with older code, and some with newer code. Oftentimes, this can go through many iterations. So each run directory is a kind of documentation for the state of the code when you ran the experiment and you never have to worry about reproducibility.

I think there's a way to make what you're proposing to work, however. For every time you run runx, we can create a single code directory, not one per run. And as you say, we'd run all the runs out of that one directory. The next time you run runx, however, because we don't know if you changed the code or not, we would be motivated to upload a new copy of the code.

So maybe there are these two things to prototype. (1) optional staging as you propose (2) still staging, but limit to only one copy per runx invocation.