Error using Nextflow with slurm, tracing and docker
Closed this issue ยท 20 comments
Problem
When running a Nextflow Workflow on the Slurm cluster with enabled traces for example for the report and using Containers with Docker then the workflow stops with the following error:
N E X T F L O W ~ version 22.10.4
Launching `https://github.com/nextflow-io/hello` [shrivelled_magritte] DSL2 - revision: 4eab81bd42 [master]
executor > slurm (4)
[81/7a68e9] process > sayHello (4) [100%] 4 of 4, failed: 2 โ
Bonjour world!
Ciao world!
Error executing process > 'sayHello (3)'
Caused by:
Process `sayHello (3)` terminated with an error exit status (1)
Command executed:
echo 'Hello world!'
Command exit status:
1
Command output:
(empty)
Command error:
touch: .command.trace: Permission denied
Work dir:
/vol/spool/work/50/3340daee9c443cd5d57c2c01db1650
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
2 out of 4 process failed. When monitoring the execution of the slurm jobs, I noticed that the 2 failed process where executed on a worker node and the successful ones on the master nodes.
If you resume the workflow (./nextflow run hello -resume
, everything works fine as the two failed processes will be scheduled on the master.
If you disable the report or docker or even use the local executor, this problem does not appear. Only in the specific combination of executor, containerization and enabled tracing.
If you log into the workernode and try to execute the workflow there with the local executor then every process fails.
Maybe this problems comes from missing access rights to the NFS folder of the docker runtime on the worker nodes? ๐ค
Steps to reproduces
- Create a cluster and change into the NFS folder
- Install Java
sudo apt install openjdk-11-jdk-headless openjdk-11-jre-headless
- Install Nextflow
curl -s https://get.nextflow.io | bash
- Create a nextflow config file with following content:
report.enabled = true
process.executor = "slurm"
docker.enabled = true
- Run a simple Workflow
./nextflow run hello
- Enable/disable docker or the report or use
"local"
as executor
Nextflow indicates that this is a permission error.
a) Might be this issue: nextflow-io/nextflow#1295 (caused by docker. In that case we only can look at workarounds)
b) Or the nfs export is faulty for this use case.
EDIT1: Setting the nfsshare to no_root_squash
allows running sudo ./nextflow run hello
without issues. I will look further into file system permissions.
EDIT1: Setting the nfsshare to
no_root_squash
allows runningsudo ./nextflow run hello
without issues. I will look further into file system permissions.
I need to start the nextflow process via the REST API of slurm. This endpoint doesn't allow to specify the user and the nextflow command will always be executed as the slurm
user.
So running the nextflow command with sudo
is unfortunately no possible for my use-case.
In every case "use sudo" wouldn't be a real solution :) but I have the feeling that this solved starting nextflow from workers in general. I am currently restarting the cluster to validate that no_root_squash
was what did the trick.
Can you retry using the branch https://github.com/BiBiServ/bibigrid/tree/nextflow-nfs-permissions ? I need to read more about nextflow in order to be certain that no_root_squash
is the solution as I am currently unsure how nextflow interacts with slurm, but maybe this already works for you.
In any case now we can run ./nextflow run hello
on workers using the "local" executor.
Yeah, everything works ๐ฅณ
Yeah, everything works
Well at least as long as the user in the docker container is root
๐ค
The root user issue might be the nextflow issue I posted under a) initially. Can you give me the steps to reproduce your error again? Simply because I haven't dived that deep into nextflow yet? Maybe I will find a solution for that, too.
You can use the same setup as before. Simple add the line process.container = 'biocontainers/samtools:v1.7.0_cv4'
to the file nextflow.config
.
Then run the workflow ./nextflow run hello
again.
The container biocontainers/samtools:v1.7.0_cv4
is definitly rootless.
Can you let me know what exact error you are getting?
It is the same error as in the original post
Thank you. Just to make sure I actually reproduced your issue: This also doesn't work on the master when using the slurm user and:
report.enabled = true
process.executor = "slurm"
docker.enabled = true
process.container = 'biocontainers/samtools:v1.7.0_cv4'
correct?
I tested it from outside the slurm user - where it works fine - but I do get the error on master and worker when I became the slurm user before. I will investigate how to fix this.
What follows is not the solution, but merely another question out of interest:
Did you change the permissions of nextflow in order to allow slurm to run it or did it just work for you?
I install nextflow globally. This means I put the executable in the directory /usr/local/bin
and change the permission to 755
.
Moreover, I make these other changes too
- Add the
slurm
user to thedocker
group - Export the environment variable
NXF_HOME=/home/slurm/.nextflow
Another important point is that you have to make sure that when you run nextflow that in the current working directory is no folder work
present where the current user doesn't have access to.
Alright, then we pretty much do have the same setup.
I might consider adding slurm
to docker
so that users do not have to do this on their own, but I will talk to Jan about it before; in case that this is a security issue.
Hey,
can you try adding docker.runOptions = '-u $(id -u):$(id -g)'
to your nextflow.config
and try if this fixes your problem?
Also: How does your worker access the executable if it is not placed in a shared folder?
nextflow
will be installed on every machine with an ansible script
I see. That shouldn't be an issue. Has my solution above fixed your problem?
I had to set up a new BiBiGrid.
It works ๐
But I suspect that some containers will not work anymore when the process inside the container needs to be root
, e.g. some files in the container are only accessible to root
Do you mean that some containers that are rootless have files for which they need root access or do you mean that root is no longer possible in general - and this is no longer possible due to my above solution? Providing an example would help me facing that issue as I am currently not able to think it through without.
I mean containers that are NOT rootless.
For example some bioinformatics tool gets compiled inside a container and the user inside of the container is root
. All the files that are copied or created inside the container belong the the root
user. So maybe, if we change the user inside the container during execution, the new user doesn't have access to these files due to missing permissions, e.g. an executable file has the permission rwxr--r--
.
But this is only my suspicion and I don't have a working example for this.
Alright, that's difficult for me to judge for now, but I see this as a valid concern. I will speak with Jan about it and keep it in mind in case another user has this issue. Thank you.
If everything works for now, I am going to close this issue.