radical-collaboration/hpc-workflows

Using a Tunnel to Download on a Non-Internet Compute Node

lsawade opened this issue · 11 comments

Internet for Non-Internet Compute Nodes

As mentioned in our meeting, I would like to add download functionality to one of my tasks. The reason for this is that the workflow is going to be a routine HPC job in the long-term.

In the normal Pipeline/Stage/Task setting on, e.g. Summit, this looks like the following:

Code sample

pipe = Pipeline()

# Create a Stage object
download = Stage()

# Create a Task object
t = Task()
t.name = 'Download'
t.pre_exec = ['module purge', 'module load anaconda3', 'conda activate gcmt3d']
t.executable = 'request-data'
t.arguments = ['-f', CIN_DB, '-p', PARAMETER_PATH]
t.download_output_data = ['STDOUT', 'STDERR']
# This is the only way I can assign singleton'se cpu task rn on Traverse (Hyungro is aware and on it)
t.cpu_reqs = {'processes': 1, 'process_type': 'MPI',  'threads_per_process': 1, 'thread_type': 'OpenMP'}
download.add_tasks(t)

# Add Stage to the Pipeline
pipe.add_stages(download)

res_dict = {
    'resource': 'princeton.traverse',
    'project_id': 'test',
    'schema': 'local',
    'walltime': 20,
    'cpus': 32 * 1,
    'gpus': 4 * 1,  # Unnecessary, but there aren't any non-gpu nodes.
}

appman = AppManager(hostname=hostname, port=port,
                    username=username, password=password, 
                    resubmit_failed=False)
appman.resource_desc = res_dict
appman.workflow = set([pipe])
appman.run()

Now, this task of course fails because Traverse nodes do not have internet. As discussed, EnTK has an undocumented way of tunneling the connection which I would like to checkout.

How would request_data work, i.e., with what data service would it need to communicate?

It's basically a wrapper around this function: https://docs.obspy.org/packages/autogen/obspy.clients.fdsn.mass_downloader.html

Which in turn is a wrapper around a url query that downloads seismic data to a specific location to the disk from here https://www.fdsn.org/webservices/

@andre-merzky, could you elaborate what this means, or requirements to you expect/want the process to look like?

i am actually not sure what that implies... But, a request / proposal: can you try to get that function to work over an arbitrary ssh tunnel? If you manage to do that, we can provide that tunnel to do the same thing from the compute nodes (the tunnel endpoint would be hosted on the batch node). Would that work for you?

How would that look? Do you mean something like this:

ssh lsawade@traverse8.princeton.edu bash -l '/home/lsawade/gcmt3d/workflow/entk/ssh-request-data.sh /tigress/lsawade/source_inversion_II/events/CMT.perturb.440/C200605061826B /home/lsawade/gcmt3d/workflow/params'

where /home/lsawade/gcmt3d/workflow/entk/ssh-request-data.sh looks like:

#!/bin/bash
module load anaconda3
conda activate gcmt3d_wenjie
request-data -f $1 -p $2

This works fine to download data on the login node executed from my home computer.

@mtitov : Mikhail, can you help Lucas (@lsawade ) to use our tunnel for that data transfer request? If that works, we can also change the resource config to pull up an additional application tunnel.

@lsawade: Hi Lucas, can you please copy the following resource config into $HOME/.radical/pilot/configs/resource_princeton.json and define IP address of the destination (or a host name) where you want to copy data in this env var RP_APP_TUNNEL_ADDR below:

{
    "traverse": {
        "schemas"                     : ["local"],
        "mandatory_args"              : [],
        "local"                       :
        {
            "job_manager_endpoint"    : "slurm://traverse.princeton.edu/",
            "job_manager_hop"         : "fork://localhost/",
            "filesystem_endpoint"     : "file://localhost/"
        },
        "default_queue"               : "test",
        "resource_manager"            : "SLURM",
        "cores_per_node"              : 32,
        "gpus_per_node"               : 4,
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "agent_launch_method"         : "SSH",
        "task_launch_method"          : "SRUN",
        "mpi_launch_method"           : "MPIRUN",
        "pre_bootstrap_0"             : ["module load anaconda3",
                                         "module load openmpi/gcc",
                                         "export RP_APP_TUNNEL_ADDR=<dest_addr>"
                                        ],
        "valid_roots"                 : [],
        "rp_version"                  : "local",
        "python_dist"                 : "default",
        "default_remote_workdir"      : "/scratch/gpfs/$USER/",
        "virtenv"                     : "/scratch/gpfs/$USER/ve.rp",
        "virtenv_mode"                : "use",
        "lfs_path_per_node"           : "/tmp",
        "lfs_size_per_node"           : 0,
        "mem_per_node"                : 0,
        "forward_tunnel_endpoint"     : "traverse.princeton.edu"
    }
}

p.s. if you use virtenv from a different location, then you can update the corresponding parameter

I will test this as soon as the cluster is back up! Thanks @mtitov

Hi @mtitov Thanks for this. It kind of fell of my radar there being AGU, holidays etc.
I'm not really clear on the usage within the workflow now. Would it be something like this:

tunnel_addr = os.environ.get('RP_APP_TUNNEL_ADDR', None)

t = Task()
t.executable = '/bin/ssh' # Assign executable to the task
t.arguments = [tunnel_addr, 'bash', '-l', 
               '/home/lsawade/gcmt3d/workflow/entk/ssh-request-data.sh '
               '/tigress/lsawade/source_inversion_II/events/CMTS/C200605061826B ' 
               '/home/lsawade/gcmt3d/workflow/params' ]  

Hi @lee212 @andre-merzky,

I'm linking #130 here because I feel I'm misunderstanding the way I am supposed to setup/use the tunnel. Comments?

Hi everyone,

After figuring out that this was all a misunderstanding and some guidance from Andre the workflow is able to download data. I'll describe the setup below. But first let me answer the main question.


Why the heck am I not pre-downloading the data?

While the main reason for the workflow is to recompute an improved database of thousands of earthquakes (the computational part for which I need the workflow), the workflow is also supposed to be future proof. Meaning whenever there is an earthquake I need to just trigger the workflow and let it run. This includes the data download.

When considering the data download as an obstacle for computation, it is peanuts compared to the forward modeling, processing, and subsequent inversion. Therefore, I can easily justify putting the download into the workflow.


How to download

The actual download script just needs internet and it will download the data onto the same filesystem! This is very important and where our misunderstanding happened. The problem was only, that I could not download as part of the workflow because the compute nodes do not have internet (this is not the case on Summit). A "diversion" of the download task to the login node did the trick.

This means, I simply trigger the execution of the script by logging into the login node from within the workflow.

Necessary steps that have to be taken as shown by @andre-merzky in slack correspondence

You would need to make sure that, on the login node, the following works without a password prompt:

ssh localhost hostname

This can be done by

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
chmod 0600 $HOME/.ssh/authorized_keys

The key names may differ, depending on system defaults...

Then we can create a task that is executed the following way

# Create a Task object
t = Task()
t.name = 'download-task'
t.executable = '/usr/bin/ssh'
t.arguments = ["<username>@<hostname>", "'bash -l /path/to/download_script.sh <input1> <input2>'"]
t.download_output_data = ['STDOUT', 'STDERR']
t.cpu_reqs = {'cpu_processes': 1, 'cpu_process_type': None, 'cpu_threads': 1, 'cpu_thread_type': None}

The bash -l is necessary because the script is unaware of commands such as module load etc. loading a bash shell is a workaround. If there is a better way, tips are welcome.

The download_script.sh in the end was needed to load anaconda etc. so that module load was necessary in my case.

Here the content:

#!/bin/bash
module load anaconda3
conda activate <your_env>
request-data -f $1 -p $2

where request-data is a binary created by my python package.

Thanks everyone for the help! This was the last lego brick!

Cheers!