Is it possible to use runx on NGC?
XinDongol opened this issue · 14 comments
Yes it is. I have a little bit of support that i'm getting ready, should be able to release it shortly.
Yes it is. I have a little bit of support that i'm getting ready, should be able to release it shortly.
It is great to know that. Thanks a lot for bringing such a useful tool for the community.
Hi @XinDongol , i've just pushed NGC support. Maybe you could try it out and provide some feedback.
I am trying to submit jobs to NGC with runx
.
I am kind of confused with the .runx
file. Where should I put the .runx
file?
Currently, I put the .runx
file in the same dir as the .py
file and the sweep.yml
file.
.runx
:
LOGROOT: /home/xxx/runx
FARM: ngc
ngc:
NGC_LOGROOT: /myws
WORKSPACE: xxxx
SUBMIT_CMD: 'ngc batch run'
RESOURCES:
image: nvidian/pytorch:19.10-py3
gpu: 1
instance: dgx1v.16g.1.norm
ace: nv-us-west-2
result: /result
sweep.yml
:
CMD: 'bash /myws/codes/ngc_comm/install.sh;python boostrap.py'
HPARAMS:
batch_size: [256]
epoch: [100]
lr: [0.01]
subset_pct: [0.1, 0.5]
di_batch_size: [256]
num_di_batch: [200]
logdir: LOGDIR
Then, I call
python -m runx.runx sweep.yml -n
to check commands.
Error message:
Traceback (most recent call last):
File "/home/xin/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/xin/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/runx.py", line 394, in <module>
main()
File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/runx.py", line 387, in main
run_experiment(args.exp_yml)
File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/runx.py", line 361, in run_experiment
experiment = read_config(args.farm, args.exp_yml)
File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/utils.py", line 96, in read_config
global_config = read_config_file()
File "/home/xin/anaconda3/lib/python3.7/site-packages/runx/utils.py", line 86, in read_config_file
global_config = yaml.load(open(config_fn), Loader=yaml.FullLoader)
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/__init__.py", line 114, in load
return loader.get_single_data()
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/constructor.py", line 41, in get_single_data
node = self.get_single_node()
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 133, in compose_mapping_node
item_value = self.compose_node(node, item_key)
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 133, in compose_mapping_node
item_value = self.compose_node(node, item_key)
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/composer.py", line 127, in compose_mapping_node
while not self.check_event(MappingEndEvent):
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/parser.py", line 428, in parse_block_mapping_key
if self.check_token(KeyToken):
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/scanner.py", line 223, in fetch_more_tokens
return self.fetch_value()
File "/home/xin/anaconda3/lib/python3.7/site-packages/yaml/scanner.py", line 579, in fetch_value
self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
in "./.runx", line 14, column 23
Thanks a lot.
Currently, I put the .runx file in the same dir as the .py file and the sweep.yml file
Correct, you put it in whichever directory that you run the runx command.
Not sure if you have installed the latest runx, but i just tried your example and it works for me:
$ python -m runx.runx sweep.yml -n
ngc batch run --image nvidian/pytorch:19.10-py3 --gpu 1 --instance dgx1v.16g.1.norm --ace nv-us-west-2 --result /result --name sweep_attentive-mongoose_2020.10.18_13.35 --commandline ' cd /myws/sweep/attentive-mongoose_2020.10.18_13.35/code; PYTHONPATH=/myws/sweep/attentive-mongoose_2020.10.18_13.35/code bash /myws/codes/ngc_comm/install.sh;python boostrap.py --batch_size 256 --epoch 100 --lr 0.01 --subset_pct 0.1 --di_batch_size 256 --num_di_batch 200 --logdir /myws/sweep/attentive-mongoose_2020.10.18_13.35 ' --workspace xxxx:/myws:RW
ngc batch run --image nvidian/pytorch:19.10-py3 --gpu 1 --instance dgx1v.16g.1.norm --ace nv-us-west-2 --result /result --name sweep_flying-bumblebee_2020.10.18_13.35 --commandline ' cd /myws/sweep/flying-bumblebee_2020.10.18_13.35/code; PYTHONPATH=/myws/sweep/flying-bumblebee_2020.10.18_13.35/code bash /myws/codes/ngc_comm/install.sh;python boostrap.py --batch_size 256 --epoch 100 --lr 0.01 --subset_pct 0.5 --di_batch_size 256 --num_di_batch 200 --logdir /myws/sweep/flying-bumblebee_2020.10.18_13.35 ' --workspace xxxx:/myws:RW
Maybe you could double check that there aren't tabs in the yaml on line 14, because the yaml reader can sometimes get confused.
Thanks a lot. It turns out it was caused by 'tab' stuff in the yaml. I will test other functions on ngc
One quick question. How to config .runx
if I just want to run codes locally? I tried to leave FARM
as blank but got some error.
Don't leave FARM blank. FARM should point to some farm definition. The definition contents just needs to contain dummy values for RESOURCES and SUBMIT_CMD, but that's about it.
But to run locally, just use -i
, for interactive.
This is my current file structure.
project_dir/
.runx
sweep.yml
main.py
In the .runx
, I only specify the LOGDIR
to make sure that logs from runx are written in the right dir.
.runx
file:
LOGROOT: /home/jovyan/codes/Octopy/di_fl/cifar10
Then, I run
python -m runx.runx sweep.yml -i -n
And got this error,
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 394, in <module>
main()
File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 387, in main
run_experiment(args.exp_yml)
File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 361, in run_experiment
experiment = read_config(args.farm, args.exp_yml)
File "/opt/conda/lib/python3.6/site-packages/runx/utils.py", line 101, in read_config
farm_name = read_config_item(global_config, 'FARM')
File "/opt/conda/lib/python3.6/site-packages/runx/utils.py", line 72, in read_config_item
raise f'can\'t find {key} in config'
TypeError: exceptions must derive from BaseException
I also tried to delete the .runx
, but still had error
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 394, in <module>
main()
File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 387, in main
run_experiment(args.exp_yml)
File "/opt/conda/lib/python3.6/site-packages/runx/runx.py", line 361, in run_experiment
experiment = read_config(args.farm, args.exp_yml)
File "/opt/conda/lib/python3.6/site-packages/runx/utils.py", line 96, in read_config
global_config = read_config_file()
File "/opt/conda/lib/python3.6/site-packages/runx/utils.py", line 84, in read_config_file
raise('can\'t find file ./.runx or ~/.config/runx.yml config files')
TypeError: exceptions must derive from BaseException
Sorry, this should work better. I'll have to clean this up.
For the time being, please try something like this for .runx:
LOGROOT: /home/jovyan/codes/Octopy/di_fl/cifar10
FARM: fake
fake:
SUBMIT_CMD: na
RESOURCES:
dummy: na
I am really enjoying runx on ngc and local machine these days. It is quite useful to find good hyperparameters. 💯 👍
I was wondering whether it is possible to make Staging of code
as a user-specified option.
Basically, it is nice to stage codes as mentioned in readme.md
.
However, staging may take forever sometimes. For example,
- there are some big files in the dir.
- Uploading to ngc workspace is slow depending on the connection.
- The
main.py
that we want to run needs to import something from its parent dir (likeimport ..utils
).
Currently, the flow is,
- upload codes to specific dir on ngc workspace
- cd this specific dir
- call
python main.py
An example,
ngc batch run --image nvidian/pytorch:20.05-py3-base --instance dgx1v.32g.1.norm
--ace nv-us-west-2 --result /result --team xx --org nvidian
--name fixed_non_iid_di_setting_fast-ringtail_2020.10.21_04.08
--commandline ' cd /myws/fixed_non_iid_di_setting/fast-ringtail_2020.10.21_04.08/code;
PYTHONPATH=/myws/fixed_non_iid_di_setting/fast-ringtail_2020.10.21_04.08/code;
python non_iid_di.py --logdir /myws/fixed_non_iid_di_setting/fast-ringtail_2020.10.21_04.08'
--workspace xxxx:/myws:RW
Without staging, we can cd
into the code dir manually by adding commands to CMD
in sweep.yml
like this,
CMD: ' cd /myws/codes/Octopy/cifar10; python non_iid_di.py'
HPARAMS:
logdir: LOGDIR
rounds: [100]
num_devices: [20]
device_pct: [1.0]
non_iid: [1]
scheduler: ['multistep', 'cosine']
Then, we can get the correct job submission command easily,
ngc batch run --image nvidian/pytorch:20.05-py3-base --instance dgx1v.32g.1.norm
--ace nv-us-west-2 --result /result --team xx --org nvidian
--name fixed_non_iid_di_setting_fast-ringtail_2020.10.21_04.08
--commandline ' cd /myws/codes/Octopy/cifar10;
python non_iid_di.py --logdir /myws/fixed_non_iid_di_setting/fast-ringtail_2020.10.21_04.08'
--workspace xxxx:/myws:RW
Hi @XinDongol , i'm really happy to hear that you're finding runx useful.
I understand that the upload is taking a while. Can I ask: in your proposal, are suggesting that all runs will use the same code dir in NGC? In other words, if you have multiple runs, they will use the same directory in NGC?
The issue with using the same directory in NGC is that it kind of breaks the paradigm of one run per directory. It also makes it challenging to use either tensorboard or sumx to view each run individually.
Do the large files within your code directory need to be staged/uploaded? I wonder if (a) they could be put into a central place so they don't need to be copied or (b) if they really aren't needed, then could you use CODE_IGNORE_PATTERNS
in .runx to exclude them from being copied?
Different runs will still use their own dir because they have different --logdir
.
Suppose my code is in the dir of /client/mycode/
on the client.
In current flow,
- copy/upload codes from
/client/mycode
to/ngc_ws/logdir/code/
on ngc workspace - cd
/ngc_ws/logdir/code/
- run codes from
/ngc_ws/logdir/code/
- write tensorboard/csv log to
logdir
If the connection to ngc is slow (for example, yesterday), uploading n times (n is the number of runs) will take a lot of time.
In my proposal, if the code is already at certain dir of /ngc_ws/mycode
on the ngc workspace, step 1,2,3 can be replaced by cd /ngc_ws/mycode
. So we can skip copying n times. Different runs will still use their own dir because they have different --logdir
.
The only difference is that the uploading solution use different PYTHONPATH='/ngc_ws/logdir/code' for different runs, but the no-uploading solution will use the same PYTHONPATH='/ngc_ws/mycode' for different runs.
Personally, I like the idea of staging. Making the staging as an option would give users more flexibility to handle bad ngc connection.
I see, yes, it makes sense.
One reason why we created a separate copy of code per run, and then upload a separate copy of code per run, it is because sometimes you will want to locally change code and then using the same experiment.yaml file, add some new runs. Now your experiement directory will contain multiple runs, some with older code, and some with newer code. Oftentimes, this can go through many iterations. So each run directory is a kind of documentation for the state of the code when you ran the experiment and you never have to worry about reproducibility.
I think there's a way to make what you're proposing to work, however. For every time you run runx, we can create a single code directory, not one per run. And as you say, we'd run all the runs out of that one directory. The next time you run runx, however, because we don't know if you changed the code or not, we would be motivated to upload a new copy of the code.
So maybe there are these two things to prototype. (1) optional staging as you propose (2) still staging, but limit to only one copy per runx invocation.