BiBiServ/bibigrid

Error using Nextflow with slurm, tracing and docker

Closed this issue ยท 20 comments

Problem

When running a Nextflow Workflow on the Slurm cluster with enabled traces for example for the report and using Containers with Docker then the workflow stops with the following error:

N E X T F L O W  ~  version 22.10.4
Launching `https://github.com/nextflow-io/hello` [shrivelled_magritte] DSL2 - revision: 4eab81bd42 [master]
executor >  slurm (4)
[81/7a68e9] process > sayHello (4) [100%] 4 of 4, failed: 2 โœ˜
Bonjour world!

Ciao world!

Error executing process > 'sayHello (3)'

Caused by:
  Process `sayHello (3)` terminated with an error exit status (1)

Command executed:

  echo 'Hello world!'

Command exit status:
  1

Command output:
  (empty)

Command error:
  touch: .command.trace: Permission denied

Work dir:
  /vol/spool/work/50/3340daee9c443cd5d57c2c01db1650

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

2 out of 4 process failed. When monitoring the execution of the slurm jobs, I noticed that the 2 failed process where executed on a worker node and the successful ones on the master nodes.
If you resume the workflow (./nextflow run hello -resume, everything works fine as the two failed processes will be scheduled on the master.

If you disable the report or docker or even use the local executor, this problem does not appear. Only in the specific combination of executor, containerization and enabled tracing.

If you log into the workernode and try to execute the workflow there with the local executor then every process fails.

Maybe this problems comes from missing access rights to the NFS folder of the docker runtime on the worker nodes? ๐Ÿค”

Steps to reproduces

  1. Create a cluster and change into the NFS folder
  2. Install Java sudo apt install openjdk-11-jdk-headless openjdk-11-jre-headless
  3. Install Nextflow curl -s https://get.nextflow.io | bash
  4. Create a nextflow config file with following content:
report.enabled = true
process.executor = "slurm"
docker.enabled = true
  1. Run a simple Workflow ./nextflow run hello
  2. Enable/disable docker or the report or use "local" as executor

Nextflow indicates that this is a permission error.

a) Might be this issue: nextflow-io/nextflow#1295 (caused by docker. In that case we only can look at workarounds)
b) Or the nfs export is faulty for this use case.
EDIT1: Setting the nfsshare to no_root_squash allows running sudo ./nextflow run hello without issues. I will look further into file system permissions.

EDIT1: Setting the nfsshare to no_root_squash allows running sudo ./nextflow run hello without issues. I will look further into file system permissions.

I need to start the nextflow process via the REST API of slurm. This endpoint doesn't allow to specify the user and the nextflow command will always be executed as the slurm user.
So running the nextflow command with sudo is unfortunately no possible for my use-case.

In every case "use sudo" wouldn't be a real solution :) but I have the feeling that this solved starting nextflow from workers in general. I am currently restarting the cluster to validate that no_root_squash was what did the trick.

Can you retry using the branch https://github.com/BiBiServ/bibigrid/tree/nextflow-nfs-permissions ? I need to read more about nextflow in order to be certain that no_root_squash is the solution as I am currently unsure how nextflow interacts with slurm, but maybe this already works for you.

In any case now we can run ./nextflow run hello on workers using the "local" executor.

Yeah, everything works ๐Ÿฅณ

Yeah, everything works

Well at least as long as the user in the docker container is root ๐Ÿค”

The root user issue might be the nextflow issue I posted under a) initially. Can you give me the steps to reproduce your error again? Simply because I haven't dived that deep into nextflow yet? Maybe I will find a solution for that, too.

You can use the same setup as before. Simple add the line process.container = 'biocontainers/samtools:v1.7.0_cv4' to the file nextflow.config.

Then run the workflow ./nextflow run hello again.

The container biocontainers/samtools:v1.7.0_cv4 is definitly rootless.

Can you let me know what exact error you are getting?

It is the same error as in the original post

Thank you. Just to make sure I actually reproduced your issue: This also doesn't work on the master when using the slurm user and:

report.enabled = true
process.executor = "slurm"
docker.enabled = true
process.container = 'biocontainers/samtools:v1.7.0_cv4'

correct?

I tested it from outside the slurm user - where it works fine - but I do get the error on master and worker when I became the slurm user before. I will investigate how to fix this.

What follows is not the solution, but merely another question out of interest:
Did you change the permissions of nextflow in order to allow slurm to run it or did it just work for you?

I install nextflow globally. This means I put the executable in the directory /usr/local/bin and change the permission to 755.

Moreover, I make these other changes too

  • Add the slurm user to the docker group
  • Export the environment variable NXF_HOME=/home/slurm/.nextflow

Another important point is that you have to make sure that when you run nextflow that in the current working directory is no folder work present where the current user doesn't have access to.

Alright, then we pretty much do have the same setup.

I might consider adding slurm to docker so that users do not have to do this on their own, but I will talk to Jan about it before; in case that this is a security issue.

Hey,
can you try adding docker.runOptions = '-u $(id -u):$(id -g)' to your nextflow.config and try if this fixes your problem?

Also: How does your worker access the executable if it is not placed in a shared folder?

nextflow will be installed on every machine with an ansible script

I see. That shouldn't be an issue. Has my solution above fixed your problem?

I had to set up a new BiBiGrid.

It works ๐Ÿ‘
But I suspect that some containers will not work anymore when the process inside the container needs to be root, e.g. some files in the container are only accessible to root

Do you mean that some containers that are rootless have files for which they need root access or do you mean that root is no longer possible in general - and this is no longer possible due to my above solution? Providing an example would help me facing that issue as I am currently not able to think it through without.

I mean containers that are NOT rootless.

For example some bioinformatics tool gets compiled inside a container and the user inside of the container is root. All the files that are copied or created inside the container belong the the root user. So maybe, if we change the user inside the container during execution, the new user doesn't have access to these files due to missing permissions, e.g. an executable file has the permission rwxr--r--.

But this is only my suspicion and I don't have a working example for this.

Alright, that's difficult for me to judge for now, but I see this as a valid concern. I will speak with Jan about it and keep it in mind in case another user has this issue. Thank you.

If everything works for now, I am going to close this issue.