Fastq file names containing the string "sample" cause the pipeline to fail

Question

Fastq file names containing the string "sample" cause the pipeline to fail

bioruffo opened this issue 3 years ago · 3 comments

Hello,
I was running JAFFA with a dummy file named just "sample_1.fastq.gz", and the pipeline was failing at the last step, like this:

==================================== Stage compile_all_results =====================================
R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> options(echo=F)
Compiling the results from:
sample_1.fastq
Done writing output jaffa_results.csv
Done writing jaffa_results.fasta
All Done.
***********************************************************************
 Citation:
   Davidson, N.M., Majewski, I.J. & Oshlack, A.
   JAFFA: High sensitivity transcriptome-focused fusion gene detection.
   Genome Med 7, 43 (2015)
***********************************************************************
ERROR: Expected output file jaffa_results.fasta could not be found


========================================= Pipeline Failed ==========================================

Expected output file jaffa_results.fasta could not be found

Use 'bpipe errors' to see output from failed commands.

Indeed, the file "jaffa_results.fasta" is never generated if the file name contains the string "sample".
While if I rename the same file to "sampl3_1.fastq.gz", the pipeline succeeds.

Apparently, this is caused by the function get_fusion_seqs() in scripts/get_fusion_seqs.bash. On line 36, it will return with no output if the first token in the line of "jaffa_results.csv" being processed contains the string "sample":

  if [[ ${field1} =~ "sample" ]]
  then
    return
  fi

This is most logically to avoid parsing the first (header) line of "jaffa_results.csv", which starts with "sample" as defined by line 40 of compile_results.R. However, with the =~ operator, this will match any line whose first token (file name) contains the string "sample".

Besides the simplest fix of changing the =~ operator to ==, perhaps a stronger solution would be to alter this conditional to check for the second token:

  # NOTE must match the header as defined in compile_results.R
  if [[ ${field2} == "fusion" ]]
  then
    return
  fi

The logic behind the proposed change is: while the first token is subject to being matched on accounts of file/sample name, the second token is safer, as it is "fusion" in the header line; while in any subsequent line it is comprised of two fused gene names, and should reasonably never be "fusion".

Thank you,

Roberto

Answer 1 · 2022-02-07T07:17:41.000Z

Hi Roberto,

Thank you very much for not only reporting this issue, but also finding the causing and suggesting a fix! I will add this into the next version of our code.

Cheers,
Nadia.

Answer 2 · 2022-02-07T08:54:10.000Z

Thank you, and thanks for the wonderful software!

Answer 3 · 2022-02-25T02:45:47.000Z

Fixed in commit 6bcad6a Along with #68 #72