Error using Nextflow with slurm, tracing and docker

Question

Error using Nextflow with slurm, tracing and docker

Closed this issue 2 years ago · 20 comments

Problem

When running a Nextflow Workflow on the Slurm cluster with enabled traces for example for the report and using Containers with Docker then the workflow stops with the following error:

N E X T F L O W  ~  version 22.10.4
Launching `https://github.com/nextflow-io/hello` [shrivelled_magritte] DSL2 - revision: 4eab81bd42 [master]
executor >  slurm (4)
[81/7a68e9] process > sayHello (4) [100%] 4 of 4, failed: 2 ✘
Bonjour world!

Ciao world!

Error executing process > 'sayHello (3)'

Caused by:
  Process `sayHello (3)` terminated with an error exit status (1)

Command executed:

  echo 'Hello world!'

Command exit status:
  1

Command output:
  (empty)

Command error:
  touch: .command.trace: Permission denied

Work dir:
  /vol/spool/work/50/3340daee9c443cd5d57c2c01db1650

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

2 out of 4 process failed. When monitoring the execution of the slurm jobs, I noticed that the 2 failed process where executed on a worker node and the successful ones on the master nodes.
If you resume the workflow (./nextflow run hello -resume, everything works fine as the two failed processes will be scheduled on the master.

If you disable the report or docker or even use the local executor, this problem does not appear. Only in the specific combination of executor, containerization and enabled tracing.

If you log into the workernode and try to execute the workflow there with the local executor then every process fails.

~~Maybe this problems comes from missing access rights to the NFS folder of the docker runtime on the worker nodes? 🤔~~

Steps to reproduces

Create a cluster and change into the NFS folder
Install Java sudo apt install openjdk-11-jdk-headless openjdk-11-jre-headless
Install Nextflow curl -s https://get.nextflow.io | bash
Create a nextflow config file with following content:

report.enabled = true
process.executor = "slurm"
docker.enabled = true

Run a simple Workflow ./nextflow run hello
Enable/disable docker or the report or use "local" as executor

Answer 1 · 2023-02-15T11:33:01.000Z

Nextflow indicates that this is a permission error.

a) Might be this issue: nextflow-io/nextflow#1295 (caused by docker. In that case we only can look at workarounds)
b) Or the nfs export is faulty for this use case.
EDIT1: Setting the nfsshare to no_root_squash allows running sudo ./nextflow run hello without issues. I will look further into file system permissions.

Answer 2 · 2023-02-15T12:16:23.000Z

EDIT1: Setting the nfsshare to no_root_squash allows running sudo ./nextflow run hello without issues. I will look further into file system permissions.

I need to start the nextflow process via the REST API of slurm. This endpoint doesn't allow to specify the user and the nextflow command will always be executed as the slurm user.
So running the nextflow command with sudo is unfortunately no possible for my use-case.

Answer 3 · 2023-02-15T12:18:10.000Z

In every case "use sudo" wouldn't be a real solution :) but I have the feeling that this solved starting nextflow from workers in general. I am currently restarting the cluster to validate that no_root_squash was what did the trick.

Answer 4 · 2023-02-15T13:37:47.000Z

Can you retry using the branch https://github.com/BiBiServ/bibigrid/tree/nextflow-nfs-permissions ? I need to read more about nextflow in order to be certain that no_root_squash is the solution as I am currently unsure how nextflow interacts with slurm, but maybe this already works for you.

In any case now we can run ./nextflow run hello on workers using the "local" executor.

Answer 5 · 2023-02-15T15:43:26.000Z

Yeah, everything works 🥳

Answer 6 · 2023-02-15T16:07:17.000Z

Yeah, everything works

Well at least as long as the user in the docker container is root 🤔

Answer 7 · 2023-02-15T16:52:20.000Z

The root user issue might be the nextflow issue I posted under a) initially. Can you give me the steps to reproduce your error again? Simply because I haven't dived that deep into nextflow yet? Maybe I will find a solution for that, too.

Answer 8 · 2023-02-15T17:24:13.000Z

You can use the same setup as before. Simple add the line process.container = 'biocontainers/samtools:v1.7.0_cv4' to the file nextflow.config.

Then run the workflow ./nextflow run hello again.

The container biocontainers/samtools:v1.7.0_cv4 is definitly rootless.

Answer 9 · 2023-02-20T12:18:13.000Z

Can you let me know what exact error you are getting?

Answer 10 · 2023-02-20T12:43:23.000Z

It is the same error as in the original post

Answer 11 · 2023-02-20T13:18:28.000Z

Thank you. Just to make sure I actually reproduced your issue: This also doesn't work on the master when using the slurm user and:

report.enabled = true
process.executor = "slurm"
docker.enabled = true
process.container = 'biocontainers/samtools:v1.7.0_cv4'

correct?

I tested it from outside the slurm user - where it works fine - but I do get the error on master and worker when I became the slurm user before. I will investigate how to fix this.

What follows is not the solution, but merely another question out of interest:
Did you change the permissions of nextflow in order to allow slurm to run it or did it just work for you?

Answer 12 · 2023-02-20T13:29:42.000Z

I install nextflow globally. This means I put the executable in the directory /usr/local/bin and change the permission to 755.

Moreover, I make these other changes too

Add the slurm user to the docker group
Export the environment variable NXF_HOME=/home/slurm/.nextflow

Another important point is that you have to make sure that when you run nextflow that in the current working directory is no folder work present where the current user doesn't have access to.

Answer 13 · 2023-02-20T13:40:48.000Z

Alright, then we pretty much do have the same setup.

I might consider adding slurm to docker so that users do not have to do this on their own, but I will talk to Jan about it before; in case that this is a security issue.

Answer 14 · 2023-02-20T14:00:28.000Z

Hey,
can you try adding docker.runOptions = '-u $(id -u):$(id -g)' to your nextflow.config and try if this fixes your problem?

Also: How does your worker access the executable if it is not placed in a shared folder?

Answer 15 · 2023-02-20T14:17:32.000Z

nextflow will be installed on every machine with an ansible script

Answer 16 · 2023-02-20T14:23:22.000Z

I see. That shouldn't be an issue. Has my solution above fixed your problem?

Answer 17 · 2023-02-20T14:49:57.000Z

I had to set up a new BiBiGrid.

It works 👍
But I suspect that some containers will not work anymore when the process inside the container needs to be root, e.g. some files in the container are only accessible to root

Answer 18 · 2023-02-20T14:53:47.000Z

Do you mean that some containers that are rootless have files for which they need root access or do you mean that root is no longer possible in general - and this is no longer possible due to my above solution? Providing an example would help me facing that issue as I am currently not able to think it through without.

Answer 19 · 2023-02-20T15:02:19.000Z

I mean containers that are NOT rootless.

For example some bioinformatics tool gets compiled inside a container and the user inside of the container is root. All the files that are copied or created inside the container belong the the root user. So maybe, if we change the user inside the container during execution, the new user doesn't have access to these files due to missing permissions, e.g. an executable file has the permission rwxr--r--.

But this is only my suspicion and I don't have a working example for this.

Answer 20 · 2023-02-20T15:03:54.000Z

Alright, that's difficult for me to judge for now, but I see this as a valid concern. I will speak with Jan about it and keep it in mind in case another user has this issue. Thank you.

If everything works for now, I am going to close this issue.