Problem with symlinks in fastq input directory

Question

Problem with symlinks in fastq input directory

Closed this issue 2 months ago · 2 comments

HenriettaHolze commented 3 months ago

Operating System

CentOS 7

Other Linux

No response

Workflow Version

wf-single-cell v2.0.2-ge9dac45

Workflow Execution

Command line (Cluster)

Other workflow execution

No response

EPI2ME Version

No response

CLI command run

nextflow run epi2me-labs/wf-single-cell
-profile singularity
--expected_cells 50000
--fastq '/scratch/teams/dawson_genomics/Projects/PRC2_BE_screen/results/MF01_nanopore/epi2me_output/test_input_links/'
--kit_name '5prime'
--kit_version 'v1'
--ref_genome_dir '/data/reference/dawson_labs/genomes/cellranger_reference_GRCh38-2020-A/refdata-gex-GRCh38-2020-A'
-w '/scratch/teams/dawson_genomics/Projects/PRC2_BE_screen/results/MF01_nanopore/epi2me_output/work/'
--out_dir '/scratch/teams/dawson_genomics/Projects/PRC2_BE_screen/results/MF01_nanopore/epi2me_output/'
--threads 16
-resume

Workflow Execution - CLI Execution Profile

singularity

What happened?

Hi, I created a number of softlinks to a subset of fastq files in a directory to test the pipeline.
When providing that directory as input, the pipeline cannot find the files.
I could only think of that the files aren't properly mounted in the singularity container.
I regularly use a different nextflow pipeline with a directory with symlinks as input. So nextflow should be able to handle it.

I checked in the work directory and the links and files are valid:

(base) papr-res-compute210 ➜  cd work/2d/3972ab2173512788d0e6df10609e1f
(base) papr-res-compute210 ➜  3972ab2173512788d0e6df10609e1f ls -lah
total 21K
drwxrwxr-x 5 hholze hholze 4.0K May 19 13:08 .
drwxrwxr-x 5 hholze hholze 4.0K May 19 13:08 ..
-rw-rw-r-- 1 hholze hholze    0 May 19 13:08 .command.begin
-rw-rw-r-- 1 hholze hholze  413 May 19 13:08 .command.err
-rw-rw-r-- 1 hholze hholze  413 May 19 13:08 .command.log
-rw-rw-r-- 1 hholze hholze    0 May 19 13:08 .command.out
-rw-rw-r-- 1 hholze hholze  11K May 19 13:08 .command.run
-rw-rw-r-- 1 hholze hholze  900 May 19 13:08 .command.sh
-rw-rw-r-- 1 hholze hholze    0 May 19 13:08 .command.trace
-rw-rw-r-- 1 hholze hholze    1 May 19 13:08 .exitcode
drwxrwxr-x 2 hholze hholze 4.0K May 19 13:08 fastcat_stats
drwxrwxr-x 2 hholze hholze 4.0K May 19 13:08 fastq_chunks
drwx------ 2 hholze hholze 4.0K May 19 13:08 histograms
lrwxrwxrwx 1 hholze hholze  107 May 19 13:08 input_src -> /scratch/teams/dawson_genomics/Projects/PRC2_BE_screen/results/MF01_nanopore/epi2me_output/test_input_links
(base) papr-res-compute210 ➜  3972ab2173512788d0e6df10609e1f ls -lah input_src
lrwxrwxrwx 1 hholze hholze 107 May 19 13:08 input_src -> /scratch/teams/dawson_genomics/Projects/PRC2_BE_screen/results/MF01_nanopore/epi2me_output/test_input_links
(base) papr-res-compute210 ➜  3972ab2173512788d0e6df10609e1f ls -lah /scratch/teams/dawson_genomics/Projects/PRC2_BE_screen/results/MF01_nanopore/epi2me_output/test_input_links
total 2.0K
drwxrwxr-x 2 hholze hholze 4.0K May 19 13:08 .
drwxrwxr-x 7 hholze hholze 4.0K May 19 13:08 ..
lrwxrwxrwx 1 hholze hholze  168 May 19 13:07 FAY30355_pass_68a76471_936db429_25.fastq.gz -> /pipeline/Runs/Nanopore/20240514_1009_MN22007_FAY30355_68a76471/no_sample/20240514_1009_MN22007_FAY30355_68a76471/fastq_pass/FAY30355_pass_68a76471_936db429_25.fastq.gz
lrwxrwxrwx 1 hholze hholze  168 May 19 13:07 FAY30355_pass_68a76471_936db429_26.fastq.gz -> /pipeline/Runs/Nanopore/20240514_1009_MN22007_FAY30355_68a76471/no_sample/20240514_1009_MN22007_FAY30355_68a76471/fastq_pass/FAY30355_pass_68a76471_936db429_26.fastq.gz

Relevant log output

ERROR ~ Error executing process > 'fastcat (1)'

Caused by:
  Process `fastcat (1)` terminated with an error exit status (1)

Command executed:

  mkdir fastcat_stats
  mkdir fastq_chunks

  # Save file as compressed fastq
  fastcat         -s test_input_links         -f fastcat_stats/per-file-stats.tsv         -i fastcat_stats/per-file-runids.txt         --histograms histograms                           input_src     | if [ "1000000" = "0" ]; then
      bgzip -@ 4 > fastq_chunks/seqs.fastq.gz
    else
      split -l 4000000 -d --additional-suffix=.fastq.gz --filter='bgzip -@ 4 > $FILE' - fastq_chunks/seqs_;
    fi

  mv histograms/* fastcat_stats

  # get n_seqs from per-file stats - need to sum them up
  awk 'NR==1{for (i=1; i<=NF; i++) {ix[$i] = i}} NR>1 {c+=$ix["n_seqs"]} END{print c}'         fastcat_stats/per-file-stats.tsv > fastcat_stats/n_seqs
  # get unique run IDs
  awk 'NR==1{for (i=1; i<=NF; i++) {ix[$i] = i}} NR>1 {print $ix["run_id"]}'         fastcat_stats/per-file-runids.txt | sort | uniq > fastcat_stats/run_ids

Command exit status:
  1

Command output:
  (empty)

Command error:
  Processing input_src/FAY30355_pass_68a76471_936db429_26.fastq.gz
  Error: could not process file input_src/FAY30355_pass_68a76471_936db429_26.fastq.gz: No such file or directory
  Processing input_src/FAY30355_pass_68a76471_936db429_25.fastq.gz
  Error: could not process file input_src/FAY30355_pass_68a76471_936db429_25.fastq.gz: No such file or directory
  Completed processing with errors. Outputs may be incomplete.

Work dir:
  /scratch/teams/dawson_genomics/Projects/PRC2_BE_screen/results/MF01_nanopore/epi2me_output/work/2d/3972ab2173512788d0e6df10609e1f

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Application activity log entry

No response

Were you able to successfully run the latest version of the workflow with the demo data?

yes

Other demo data information

No response

Answer 1 · 2024-05-19T20:46:27.000Z

Hi @HenriettaHolze

The input file locations should be automounted, but maybe your version of singularity does not support this.

You can try manually mounting the location by creating a config file as below (call it mount.config).

process {
    containerOptions = '--bind /pipeline/Runs/Nanopore/:/pipeline/Runs/Nanopore/
}

And adding the following to your command: -c mount.config

Thanks,

Neil

Answer 2 · 2024-05-21T18:00:17.000Z

Hi @HenriettaHolze,

To add to Neil's answer: Nextflow does not check for symlinks inside a process' input directories.

I think it's easier to understand the problem (and why you don't encounter it every time you use symlinks in input directories) with a little extra background. Let's have a look at how Nextflow decides what locations to mount when launching a container:

In most cases, Nextflow will collect a list with three types of directories to mount:

$projectDir/bin (for scripts in bin; this is omitted if there is no $projectDir/bin directory)
the work directory of the process
directories containing inputs

It will then search this list of paths for the longest "common" paths (excluding /) and mount these. Let's consider a simple example to illustrate:

Make 3 files: main.nf, /tmp/gh-example-1/file.txt, and /tmp/gh-example-2/file.txt (see below for file contents).
Create a directory called linked-files in the parent directory of main.nf and create symlinks to the other two files in there.

The structure of our example directory looks

$ tree
.
├── linked-files
│   ├── f1.txt -> /tmp/gh-example-1/file.txt
│   └── f2.txt -> /tmp/gh-example-2/file.txt
└── main.nf

and the file contents are:

$ more * linked-files/*

*** linked-files: directory ***

::::::::::::::
main.nf
::::::::::::::
process count_lines_in_multiple_files {
    input: path "inputs/*"
    output: stdout
    script:
    """
    wc -l inputs/*
    """
}

workflow {
    Channel.fromPath("linked-files/*")
    | collect
    | count_lines_in_multiple_files
    | view
}

*** output: directory ***

::::::::::::::
linked-files/f1.txt
::::::::::::::
1
2
3
::::::::::::::
linked-files/f2.txt
::::::::::::::
a
b
c

Now, when we run nextflow with Docker (or Singularity, won't make a difference in that case)

$ nextflow run main.nf -with-docker ubuntu:latest
Nextflow 24.04.1 is available - Please consider updating your version to it
N E X T F L O W  ~  version 23.10.1
Launching `main.nf` [determined_lagrange] DSL2 - revision: 6a9af8817d
executor >  local (1)
[1e/0d3b5b] process > count_lines_in_multiple_files [100%] 1 of 1 ✔
 3 inputs/f1.txt
 3 inputs/f2.txt
 6 total

If we go into the work dir of that process and have a look at .command.run, we'll see for the docker run command

docker run -i --cpu-shares 1024 -e "NXF_TASK_WORKDIR" -e "NXF_DEBUG=${NXF_DEBUG:=0}" -v /tmp:/tmp -v /home/jle/gh-nxf-example/work/1e/0d3b5b9b76aa76c18d2bc13eab74c3:/home/jle/gh-nxf-example/work/1e/0d3b5b9b76aa76c18d2bc13eab74c3 -w "$PWD" --name $NXF_BOXID ubuntu:latest /bin/bash /home/jle/gh-nxf-example/work/1e/0d3b5b9b76aa76c18d2bc13eab74c3/.command.run nxf_trace

with the two mounts -v /tmp:/tmp and -v /home/jle/gh-nxf-example/work/1e/0d3b5b9b76aa76c18d2bc13eab74c3:/home/jle/gh-nxf-example/work/1e/0d3b5b9b76aa76c18d2bc13eab74c3.

If we now create an empty bin directory in the parent directory of main.nf and run nextflow run ... again, the docker run command in .command.run will change to

docker run -i --cpu-shares 1024 -e "NXF_TASK_WORKDIR" -e "NXF_DEBUG=${NXF_DEBUG:=0}" -v /tmp:/tmp -v /home/jle/gh-nxf-example:/home/jle/gh-nxf-example -w "$PWD" --name $NXF_BOXID ubuntu:latest /bin/bash -c "eval $(nxf_container_env); /bin/bash /home/jle/gh-nxf-example/work/8a/e2944e3c2eea3a0444b31b15bdf6ae/.command.run nxf_trace"

Note that the second volume changed to -v /home/jle/gh-nxf-example:/home/jle/gh-nxf-example.

Let's unpack this in light of what we said before. In the first case, the list of targets to mount is

the work dir (/home/jle/gh-nxf-example/work/1e/0d3b5b9b76aa76c18d2bc13eab74c3
the inputs (/tmp/gh-example-1/file.txt and /tmp/gh-example-2/file.txt)

Since the input paths share /tmp, only this will be mounted.

In the second case we also got a $projectDir/bin directory that has to be mounted. As this shares /home/jle/gh-nxf-example with the work dir, only /home/jle/gh-nxf-example is mounted.

Now, why didn't the above fail with an error similar to what you got? It's because we didn't pass a directory containing symlinks to the count_lines_in_multiple_files process but rather the symlinks themselves (and nextflow figured out that it had to mount /tmp). Now, if we change our main.nf to the below (note that we pass the directory itself now and modified the process accordingly)

process count_lines_in_multiple_files {
    input: path "inputs"
    output: stdout
    script:
    """
    wc -l inputs/*
    """
}

workflow {
    Channel.fromPath("linked-files")
    | count_lines_in_multiple_files
    | view
}

we get an error:

$ nextflow run main.nf -with-docker ubuntu:latest
Nextflow 24.04.1 is available - Please consider updating your version to it
N E X T F L O W  ~  version 23.10.1
Launching `main.nf` [nauseous_ritchie] DSL2 - revision: 8847bbb6cb
executor >  local (1)
[23/347031] process > count_lines_in_multiple_files (1) [100%] 1 of 1, failed: 1 ✘
ERROR ~ Error executing process > 'count_lines_in_multiple_files (1)'

Caused by:
  Process `count_lines_in_multiple_files (1)` terminated with an error exit status (1)

Command executed:

  wc -l inputs/*

Command exit status:
  1

Command output:
  0 total

Command error:
  0 total
  wc: inputs/f1.txt: No such file or directory
  wc: inputs/f2.txt: No such file or directory

Work dir:
  /home/jle/gh-nxf-example/work/23/3470319bd0ca3dd7378095b41fb0ef

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Taken everything together, we can say the following

Nextflow won't mount locations of files that are symlinked inside directories passed to processes.
It will, however, "collapse" paths of inputs that share a common top-level directory and this is most likely the reason why you didn't run into this problem earlier: the symlinks in the earlier cases probably lead to a location that was in the same top-level directory as either your work dir or projectDir (which in most cases is $HOME/.nextflow/assets/...).

If you want to use symlinks leading to files in /pipeline, there are broadly two ways to avoid this problem:

run the workflow somewhere in /pipeline as well
explicitly mount /pipeline in a Nextflow config file like Neil explained in his answer.

I hope this was helpful, please let us know if you have any other questions.