Using a Tunnel to Download on a Non-Internet Compute Node
lsawade opened this issue · 11 comments
Internet for Non-Internet Compute Nodes
As mentioned in our meeting, I would like to add download functionality to one of my tasks. The reason for this is that the workflow is going to be a routine HPC job in the long-term.
In the normal Pipeline/Stage/Task setting on, e.g. Summit, this looks like the following:
Code sample
pipe = Pipeline()
# Create a Stage object
download = Stage()
# Create a Task object
t = Task()
t.name = 'Download'
t.pre_exec = ['module purge', 'module load anaconda3', 'conda activate gcmt3d']
t.executable = 'request-data'
t.arguments = ['-f', CIN_DB, '-p', PARAMETER_PATH]
t.download_output_data = ['STDOUT', 'STDERR']
# This is the only way I can assign singleton'se cpu task rn on Traverse (Hyungro is aware and on it)
t.cpu_reqs = {'processes': 1, 'process_type': 'MPI', 'threads_per_process': 1, 'thread_type': 'OpenMP'}
download.add_tasks(t)
# Add Stage to the Pipeline
pipe.add_stages(download)
res_dict = {
'resource': 'princeton.traverse',
'project_id': 'test',
'schema': 'local',
'walltime': 20,
'cpus': 32 * 1,
'gpus': 4 * 1, # Unnecessary, but there aren't any non-gpu nodes.
}
appman = AppManager(hostname=hostname, port=port,
username=username, password=password,
resubmit_failed=False)
appman.resource_desc = res_dict
appman.workflow = set([pipe])
appman.run()
Now, this task of course fails because Traverse nodes do not have internet. As discussed, EnTK has an undocumented way of tunneling the connection which I would like to checkout.
How would request_data
work, i.e., with what data service would it need to communicate?
It's basically a wrapper around this function: https://docs.obspy.org/packages/autogen/obspy.clients.fdsn.mass_downloader.html
Which in turn is a wrapper around a url query that downloads seismic data to a specific location to the disk from here https://www.fdsn.org/webservices/
@andre-merzky, could you elaborate what this means, or requirements to you expect/want the process to look like?
i am actually not sure what that implies... But, a request / proposal: can you try to get that function to work over an arbitrary ssh tunnel? If you manage to do that, we can provide that tunnel to do the same thing from the compute nodes (the tunnel endpoint would be hosted on the batch node). Would that work for you?
How would that look? Do you mean something like this:
ssh lsawade@traverse8.princeton.edu bash -l '/home/lsawade/gcmt3d/workflow/entk/ssh-request-data.sh /tigress/lsawade/source_inversion_II/events/CMT.perturb.440/C200605061826B /home/lsawade/gcmt3d/workflow/params'
where /home/lsawade/gcmt3d/workflow/entk/ssh-request-data.sh
looks like:
#!/bin/bash
module load anaconda3
conda activate gcmt3d_wenjie
request-data -f $1 -p $2
This works fine to download data on the login node executed from my home computer.
@lsawade: Hi Lucas, can you please copy the following resource config into $HOME/.radical/pilot/configs/resource_princeton.json
and define IP address of the destination (or a host name) where you want to copy data in this env var RP_APP_TUNNEL_ADDR
below:
{
"traverse": {
"schemas" : ["local"],
"mandatory_args" : [],
"local" :
{
"job_manager_endpoint" : "slurm://traverse.princeton.edu/",
"job_manager_hop" : "fork://localhost/",
"filesystem_endpoint" : "file://localhost/"
},
"default_queue" : "test",
"resource_manager" : "SLURM",
"cores_per_node" : 32,
"gpus_per_node" : 4,
"agent_scheduler" : "CONTINUOUS",
"agent_spawner" : "POPEN",
"agent_launch_method" : "SSH",
"task_launch_method" : "SRUN",
"mpi_launch_method" : "MPIRUN",
"pre_bootstrap_0" : ["module load anaconda3",
"module load openmpi/gcc",
"export RP_APP_TUNNEL_ADDR=<dest_addr>"
],
"valid_roots" : [],
"rp_version" : "local",
"python_dist" : "default",
"default_remote_workdir" : "/scratch/gpfs/$USER/",
"virtenv" : "/scratch/gpfs/$USER/ve.rp",
"virtenv_mode" : "use",
"lfs_path_per_node" : "/tmp",
"lfs_size_per_node" : 0,
"mem_per_node" : 0,
"forward_tunnel_endpoint" : "traverse.princeton.edu"
}
}
p.s. if you use virtenv
from a different location, then you can update the corresponding parameter
Hi @mtitov Thanks for this. It kind of fell of my radar there being AGU, holidays etc.
I'm not really clear on the usage within the workflow now. Would it be something like this:
tunnel_addr = os.environ.get('RP_APP_TUNNEL_ADDR', None)
t = Task()
t.executable = '/bin/ssh' # Assign executable to the task
t.arguments = [tunnel_addr, 'bash', '-l',
'/home/lsawade/gcmt3d/workflow/entk/ssh-request-data.sh '
'/tigress/lsawade/source_inversion_II/events/CMTS/C200605061826B '
'/home/lsawade/gcmt3d/workflow/params' ]
Hi @lee212 @andre-merzky,
I'm linking #130 here because I feel I'm misunderstanding the way I am supposed to setup/use the tunnel. Comments?
Hi everyone,
After figuring out that this was all a misunderstanding and some guidance from Andre the workflow is able to download data. I'll describe the setup below. But first let me answer the main question.
Why the heck am I not pre-downloading the data?
While the main reason for the workflow is to recompute an improved database of thousands of earthquakes (the computational part for which I need the workflow), the workflow is also supposed to be future proof. Meaning whenever there is an earthquake I need to just trigger the workflow and let it run. This includes the data download.
When considering the data download as an obstacle for computation, it is peanuts compared to the forward modeling, processing, and subsequent inversion. Therefore, I can easily justify putting the download into the workflow.
How to download
The actual download script just needs internet and it will download the data onto the same filesystem! This is very important and where our misunderstanding happened. The problem was only, that I could not download as part of the workflow because the compute nodes do not have internet (this is not the case on Summit). A "diversion" of the download task to the login node did the trick.
This means, I simply trigger the execution of the script by logging into the login node from within the workflow.
Necessary steps that have to be taken as shown by @andre-merzky in slack correspondence
You would need to make sure that, on the login node, the following works without a password prompt:
ssh localhost hostname
This can be done by
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys chmod 0600 $HOME/.ssh/authorized_keysThe key names may differ, depending on system defaults...
Then we can create a task that is executed the following way
# Create a Task object
t = Task()
t.name = 'download-task'
t.executable = '/usr/bin/ssh'
t.arguments = ["<username>@<hostname>", "'bash -l /path/to/download_script.sh <input1> <input2>'"]
t.download_output_data = ['STDOUT', 'STDERR']
t.cpu_reqs = {'cpu_processes': 1, 'cpu_process_type': None, 'cpu_threads': 1, 'cpu_thread_type': None}
The bash -l
is necessary because the script is unaware of commands such as module load
etc. loading a bash shell is a workaround. If there is a better way, tips are welcome.
The download_script.sh
in the end was needed to load anaconda etc. so that module load was necessary in my case.
Here the content:
#!/bin/bash
module load anaconda3
conda activate <your_env>
request-data -f $1 -p $2
where request-data
is a binary created by my python package.
Thanks everyone for the help! This was the last lego brick!
Cheers!