Oshlack/JAFFA

Fastq file names containing the string "sample" cause the pipeline to fail

bioruffo opened this issue · 3 comments

Hello,
I was running JAFFA with a dummy file named just "sample_1.fastq.gz", and the pipeline was failing at the last step, like this:

==================================== Stage compile_all_results =====================================
R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> options(echo=F)
Compiling the results from:
sample_1.fastq
Done writing output jaffa_results.csv
Done writing jaffa_results.fasta
All Done.
***********************************************************************
 Citation:
   Davidson, N.M., Majewski, I.J. & Oshlack, A.
   JAFFA: High sensitivity transcriptome-focused fusion gene detection.
   Genome Med 7, 43 (2015)
***********************************************************************
ERROR: Expected output file jaffa_results.fasta could not be found


========================================= Pipeline Failed ==========================================

Expected output file jaffa_results.fasta could not be found

Use 'bpipe errors' to see output from failed commands.

Indeed, the file "jaffa_results.fasta" is never generated if the file name contains the string "sample".
While if I rename the same file to "sampl3_1.fastq.gz", the pipeline succeeds.

Apparently, this is caused by the function get_fusion_seqs() in scripts/get_fusion_seqs.bash. On line 36, it will return with no output if the first token in the line of "jaffa_results.csv" being processed contains the string "sample":

  if [[ ${field1} =~ "sample" ]]
  then
    return
  fi

This is most logically to avoid parsing the first (header) line of "jaffa_results.csv", which starts with "sample" as defined by line 40 of compile_results.R. However, with the =~ operator, this will match any line whose first token (file name) contains the string "sample".

Besides the simplest fix of changing the =~ operator to ==, perhaps a stronger solution would be to alter this conditional to check for the second token:

  # NOTE must match the header as defined in compile_results.R
  if [[ ${field2} == "fusion" ]]
  then
    return
  fi

The logic behind the proposed change is: while the first token is subject to being matched on accounts of file/sample name, the second token is safer, as it is "fusion" in the header line; while in any subsequent line it is comprised of two fused gene names, and should reasonably never be "fusion".

Thank you,

Roberto

Hi Roberto,

Thank you very much for not only reporting this issue, but also finding the causing and suggesting a fix! I will add this into the next version of our code.

Cheers,
Nadia.

Thank you, and thanks for the wonderful software!

Fixed in commit 6bcad6a Along with #68 #72