miniwdl-ext/miniwdl-slurm

Handling NFS latency

Closed this issue · 8 comments

Has anyone else encountered any errors that could be explained by NFS latency?

I'm using miniwdl-slurm to run tasks in singularity containers, and I have one task that uses parallel. I'm seeing stochastic error messages like this:

parallel: Error: Tmpdir '/tmp' is not writable.
parallel: Error: Try 'chmod +w /tmp'

Given that /tmp is mounted to a working directory created by miniwdl with plenty of available space, I think it's unlikely to be a permissions error. I've seen errors like this before with snakemake, where files or directories created on one cluster node can't yet be seen from another node. Typically for snakemake, the files are created on a compute node and latency is on node that spawns new jobs, and you can handle by adding a --latency-wait value.

Is it possible that I'm seeing the opposite here, with a file created on the spawning node can't yet be seen from a compute node? Have you considered incorporating an option to allow for NFS latency?

ping @DavyCats
@williamrowell Can you provide the workflow or task? Does this only occur with tasks using parallel ?

mlin commented

The SingularityContainer base class inside miniwdl, which miniwdl-slurm inherits from, sets up /tmp in a slightly unusual way due to some friction points with the default local singularity configuration. See:

https://github.com/chanzuckerberg/miniwdl/blob/1bc3776e65d3069c2e890fa73aee0c5e44861e69/WDL/runtime/backend/singularity.py#L94-L105

This wasn' designed with the cluster/distributed case in mind, so it may be worth looking at overriding in the derived class (or generalizing in the base class) so that /tmp isn't network-mounted at all (if in fact that's the case now). Anyway, while I suspect this is relevant, it doesn't necessarily explain the exact reported behavior with stochastic unwrite-ability.

Lastly, on the general issue of NFS consistency, miniwdl-aws has an option to inject some fsync commands at the end of each task, may be worth taking a look at:

https://github.com/miniwdl-ext/miniwdl-aws/blob/e0c0419213449ad4d8c6dbb7dd44383420412e59/miniwdl_aws/batch_job.py#L255-L257

But the challenge is that the semantics of fsync wrt visibility on other nodes is totally dependent on the NFS client and server software and configuration; it's difficult to make a general statement about when this may help. And again, for the specific case of /tmp we should probably look for a way to keep that off of NFS in the first place.

miniwdl-aws has an option to inject some fsync commands at the end of each task, may be worth taking a look at:

Fsync is a no go. You are rarely the only on using a compute node. If someone else's task is writing data then fsync will block for a very long time. I have seen reports were people had fsync basically never finish.

This wasn' designed with the cluster/distributed case in mind, so it may be worth looking at overriding in the derived class (or generalizing in the base class) so that /tmp isn't network-mounted at all (if in fact that's the case now). Anyway, while I suspect this is relevant, it doesn't necessarily explain the exact reported behavior with stochastic unwrite-ability.

A good practice is to actually do make /tmp NFS-mounted, but you make the directory beforehand and specify it so singularity will always use this existing /tmp directory globally across al tasks. @williamrowell did you use this approach? I believe you can force singularity to use a particular dir using the SINGULARITY_TMPDIR environment variable, IIRC.

When /tmp is not NFS-mounted you will run into issues where individual compute nodes have less available disk space than you require on /tmp.

@williamrowell Can you provide the workflow or task? Does this only occur with tasks using parallel?

Sorry I've been slow to respond, as I've been traveling. The workflow I'm using is private, but I hope to be able to share it soon. I can't say that the latency issue only occurs for tasks using parallel, but the step where I'm seeing a filesystem error most often is:

task deepvariant_make_examples {

    ...

    command <<<
        set -euo pipefail

        mkdir example_tfrecords nonvariant_site_tfrecords

        seq ~{task_start_index} ~{task_end_index} \
        | parallel \
            --jobs ~{tasks_per_shard} \
            --halt 2 \
            /opt/deepvariant/bin/make_examples \
                ...
                --task {}

        tar -zcvf ~{sample_id}.~{task_start_index}.example_tfrecords.tar.gz example_tfrecords
        tar -zcvf ~{sample_id}.~{task_start_index}.nonvariant_site_tfrecords.tar.gz nonvariant_site_tfrecords
    >>>
    ...

    runtime {
        docker: "gcr.io/deepvariant-docker/deepvariant:~{deepvariant_version}"
        ...
    }
}

It seems odd to me that parallel fails here when the mkdir immediately proceeding it hasn't failed, and both are writing to the same filesystem.

A good practice is to actually do make /tmp NFS-mounted, but you make the directory beforehand and specify it so singularity will always use this existing /tmp directory globally across al tasks. @williamrowell did you use this approach? I believe you can force singularity to use a particular dir using the SINGULARITY_TMPDIR environment variable, IIRC.

When /tmp is not NFS-mounted you will run into issues where individual compute nodes have less available disk space than you require on /tmp.

In general, on our HPC we set TMPDIR=/scratch, a local disk with a lot of storage. I could be wrong, but I think that $SINGULARITY_TMPDIR confusingly refers to a temporary directory for building images. To bind a directory to /tmp, I think you need --bind $TMPDIR:/tmp or export SINGULARITY_BINDPATH="$TMPDIR:/tmp". The danger here is that not all sysadmins allow the user to define bindpaths.

I agree with @mlin that there are two issues here: handling latency for NFS in general, and handling /tmp. I'm not really sure what the best solution is though. Sorry I'm bringing more problems than ideas.

Is this a latency issue though? As it throws a "not writable" error rather than a "does not exist" error.
It is a bit odd that /tmp is not writable. Are the singularity processes running as the same user as the miniwdl process?

Singularity processes are owned by the miniwdl process, run by the same user.

I agree, it's not the error message I expect. Maybe this isn't latency at all, but some other weird stochastic error. If the task is retried, it will (most of the time) complete successfully.

I would expect an error from singularity stating that the bound directory does not exist, like in the example below. (lightly edited for readability)

wrowell@vm-login8-2:sing_test$ singularity pull docker://ubuntu:latest
INFO:    Using cached SIF image

# try creating a file within the PWD
wrowell@vm-login8-2:sing_test$ singularity exec ubuntu_latest.sif touch in_working_directory.txt

# try creating a file within a bound directory that does exist on host
wrowell@vm-login8-2:sing_test$ mkdir tmp
wrowell@vm-login8-2:sing_test$ singularity exec -B tmp:/tmp ubuntu_latest.sif touch /tmp/in_local_tmp_directory.txt

# try creating a file within a "bound" directory that doesn't exist on host
wrowell@vm-login8-2:sing_test$ singularity exec -B not_available:/tmp ubuntu_latest.sif touch /tmp/directory_not_available_on_host.txt
FATAL:   container creation failed: mount hook function failure: mount sing_test/not_available->/tmp error: while mounting sing_test/not_available: mount source sing_test/not_available doesn't exist

wrowell@vm-login8-2:sing_test$ tree
.
├── tmp
│   └── in_local_tmp_directory.txt
└── ubuntu_latest.sif

1 directory, 2 files

Was this issue solved on your end? Can you post what the solution was? obligatory xkcd

I closed it because it didn't seem to be nfs latency given the different error message for a mount source that "doesn't exist".

This has been occurring a lot, in one case in up to ~15% for one type of task, but in a completely non-reproducible way. Yesterday I restarted a workflow where this happened in 22 tasks previously, but this time all of the tasks completed successfully. I'm prepared now to add debugging to singularity logs if it pops up again. If I learn anything else, I'll update here.