ConesaLab/SQANTI3

FileNotFoundError: [Errno 2] No such file or directory: 'sqanti 3_output_splits/0/sqanti3_output_corrected.faa'

lscdanson opened this issue ยท 18 comments

Is there an existing issue for this?

  • I have searched the existing issues

Have you loaded the SQANTI3.env conda environment?

  • I have loaded the SQANTI3.env conda environment

Problem description

Hi, the captioned error returned as I ran sqanti3_qc.py. I checked in the sqanti 3_output_splits/0/ directory that a FASTA file exits ("sqanti3_output_corrected.fasta") so I'm not sure if the recent release has changed the file suffix (or it's actually another file). Please also find my command for your reference:

`#!/bin/sh

#SBATCH --time=2-00:00:00
#SBATCH --ntasks=30
#SBATCH --mem=256G
#SBATCH --partition=long

python /ceph/project/cribbslab/shared/proj048/analyses/SQANTI3-5.2.1/sqanti3_qc.py
--fasta /ceph/project/cribbslab/shared/proj048/analyses/PBMC_Sample4_unstim_R1/BLAZE/FLAMES/output/transcript_assembly.fa
/ceph/project/cribbslab/shared/proj048/analyses/PBMC_Sample4_unstim_R1/BLAZE/FLAMES/gencode.v45.chr_patch_hapl_scaff.annotation.gtf
/ceph/project/cribbslab/shared/proj048/analyses/PBMC_Sample4_unstim_R1/BLAZE/FLAMES/GRCh38.p14.genome.fa
--force_id_ignore --cpus 30 --report both -d /ceph/project/cribbslab/shared/proj048/analyses/PBMC_Sample4_unstim_R1/BLAZE/FLAMES/SQANTI3
-o sqanti3_output -n 10`

Many thanks!

Code sample

No response

Error

No response

Anything else?

No response

slurm-1338135.txt
Please also find my output log file.

Hi @lscdanson,

It looks like you are missing a library, "NameError: name 'BioReaders' is not defined", and that's why the qanti3_output_splits/0/sqanti3_output_corrected.faa file is not being generated. Can you try installing BioReaders and running again SQANTI3?

It looks like there's something going on with BioReaders, since it should come from cDNA_cupcake. @TianYuan-Liu can you take a look at this?

Hi Carol, Alejandro and I discussed this issue earlier today. Could you please send us the file so we can replicate the issue? Thanks! @lscdanson

Hi @lscdanson ,

There was an issue importing the class BioReaders, thanks for letting us know. This is fixed in the last commit 6345adf.

Alejandro

Hi I still have the same error after re-installing SQ3. I tried to look for a package named BioReaders to install but couldn't find it. I reckon this is probably a class created by cDNA_Cupcake, so I tried installing cDNA_Cupcake but I ran into a compilation error as others described: https://github.com/Magdoll/cDNA_Cupcake/issues/241 (P.S. cDNA_Cupcake has been archived). Is there an easier solution to it?

Here is my cDNA_Cupcake installation error in full:
(SQANTI3.env) dloi@imm-login1:/ceph/project/cribbslab/shared/proj048/analyses/cDNA_Cupcake$ python setup.py build
Compiling cupcake/tofu/branch/intersection_unique.pyx because it changed.
Compiling cupcake/tofu/branch/c_branch.pyx because it changed.
Compiling cupcake/ice/find_ECE.pyx because it changed.
[1/3] Cythonizing cupcake/ice/find_ECE.pyx
/ceph/project/cribbslab/dloi/miniforge3/envs/SQANTI3.env/lib/python3.10/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: /ceph/project/cribbslab/shared/proj048/analyses/cDNA_Cupcake/cupcake/ice/find_ECE.pyx
tree = Parsing.p_module(s, pxd, full_module_name)
[2/3] Cythonizing cupcake/tofu/branch/c_branch.pyx
/ceph/project/cribbslab/dloi/miniforge3/envs/SQANTI3.env/lib/python3.10/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: /ceph/project/cribbslab/shared/proj048/analyses/cDNA_Cupcake/cupcake/tofu/branch/c_branch.pyx
tree = Parsing.p_module(s, pxd, full_module_name)

Error compiling Cython file:

...
exon_tree.insert_interval(Interval(e_start+offset, i+offset, index))
index += 1
tag = False
elif baseC[i] > 0 and (altC_pos[i] > threshSplit or altC_neg[i+1] < -threshSplit): # alt. junction found!
# end the current exon at i and start a new one at i + 1
print "alt. junction found at", i
^

cupcake/tofu/branch/c_branch.pyx:30:22: Syntax error in simple statement list
Traceback (most recent call last):
File "/ceph/project/cribbslab/shared/proj048/analyses/cDNA_Cupcake/setup.py", line 25, in
ext_modules = cythonize(ext_modules),
File "/ceph/project/cribbslab/dloi/miniforge3/envs/SQANTI3.env/lib/python3.10/site-packages/Cython/Build/Dependencies.py", line 1154, in cythonize
cythonize_one(*args)
File "/ceph/project/cribbslab/dloi/miniforge3/envs/SQANTI3.env/lib/python3.10/site-packages/Cython/Build/Dependencies.py", line 1321, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: cupcake/tofu/branch/c_branch.pyx

Also not sure if it's relevant โ€“ my input data is one of the output files (transcript_assembly.fa) from FLAMES, derived from single-cell 10x data using the Oxford Nanopore platform but not Iso-seq.

Hi @iscdanson, can you check the version of SQANTI you installed? It should be at least v5.2.1

Hi I downloaded the v5.2.1 release as instructed (https://github.com/ConesaLab/SQANTI3/archive/refs/tags/v5.2.1.tar.gz) in your Dependencies and Installations page. I also checked the version and it returned v5.2:

(SQANTI3.env) dloi@imm-login1:/ceph/project/cribbslab/shared/proj048/analyses/PBMC_Sample4_unstim_R1/BLAZE/FLAMES/SQANTI3$ python /ceph/project/cribbslab/shared/proj048/analyses/SQANTI3-5.2.1/sqanti3_qc.py --version
Rscript (R) version 4.3.1 (2023-06-16)
SQANTI3 5.2

Hi @iscdanson, you should install the development version by cloning the repository. The problem with the BioReaders was fixed by @alexpan00 in commit 6345adf, it has not yet been included into the latest release.

Hi @carolinamonzo it works now! Thanks for being so patient with me! The output HTML looks fantastic and very nicely made!

Awesome! No worries, we are happy to help ๐Ÿ˜‰

Hi @iscdanson, you should install the development version by cloning the repository. The problem with the BioReaders was fixed by @alexpan00 in commit 6345adf, it has not yet been included into the latest release.

Is this still not fixed in the production release of SQANTI3? The docs still state to use https://github.com/ConesaLab/SQANTI3/archive/refs/tags/v5.2.1.tar.gz, and I seem to be getting the same error with v5.2.1:

Process Process-4:
Traceback (most recent call last):
  File "/home/nickyoungblut/miniforge3/envs/SQANTI3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/nickyoungblut/miniforge3/envs/SQANTI3/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/nickyoungblut/dev/bfx/SQANTI3-5.2.1/./sqanti3_qc.py", line 1853, in run
    orfDict = correctionPlusORFpred(args, genome_dict)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nickyoungblut/dev/bfx/SQANTI3-5.2.1/./sqanti3_qc.py", line 503, in correctionPlusORFpred
    err_correct(args.genome, corrSAM, corrFASTA, genome_dict=genome_dict)
  File "/home/nickyoungblut/dev/bfx/SQANTI3-5.2.1/utilities/cupcake/sequence/err_correct_w_genome.py", line 26, in err_correct
    reader = BioReaders.GMAPSAMReader(sam_file, True)
             ^^^^^^^^^^
NameError: name 'BioReaders' is not defined
Traceback (most recent call last):
  File "/home/nickyoungblut/dev/bfx/SQANTI3-5.2.1/./sqanti3_qc.py", line 2525, in <module>
    main()
  File "/home/nickyoungblut/dev/bfx/SQANTI3-5.2.1/./sqanti3_qc.py", line 2515, in main
    combine_split_runs(args, split_dirs)
  File "/home/nickyoungblut/dev/bfx/SQANTI3-5.2.1/./sqanti3_qc.py", line 2312, in combine_split_runs
    with open(_orf) as h: f_faa.write(h.read())
         ^^^^^^^^^^

The commit history shows many changes since the last release back in March. Is there a timeline for creating the new release?

I just cloned the main branch and ran sqanti3_qc.py. The script still fails, but at a different spot:

**** Predicting ORF sequences...
**** Parsing Reference Transcriptome....
**** Parsing Isoforms....
Splice Junction Coverage files not provided.
**** TSS ratio will not be calculated since SR information was not provided
**** Performing Classification of Isoforms....
Number of classified isoforms: 3181
**** RT-switching computation....
Full-length read abundance files not provided.
Isoforms expression files not provided.
**** Writing output files....
Removing temporary files....
SQANTI3 complete in 158.256359139923 sec.
Traceback (most recent call last):
  File "/home/nickyoungblut/dev/bfx/SQANTI3/./sqanti3_qc.py", line 2577, in <module>
    main()
  File "/home/nickyoungblut/dev/bfx/SQANTI3/./sqanti3_qc.py", line 2567, in main
    combine_split_runs(args, split_dirs)
  File "/home/nickyoungblut/dev/bfx/SQANTI3/./sqanti3_qc.py", line 2362, in combine_split_runs
    with open(_orf) as h: f_faa.write(h.read())
         ^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/nickyoungblut/dev/bfx/SQANTI3/sqanti3_qc_output/B16_Bl6_ENPP3KO_lung_1_splits/1/B16_Bl6_ENPP3KO_lung_1_corrected.faa'

It appears that combine_split_runs is making a lot of assumptions about what files exist.

FYI:

    f_fasta = open(corrFASTA, 'w')
    f_gtf = open(corrGTF, 'w')
        
    with open(_gtf) as h: f_gtf.write(h.read())
    with open(_fasta) as h: f_fasta.write(h.read())

    f_fasta.close()
    f_gtf.close()

...could be simplified to:

import shutil
shutil.move(_gtf, f_gtf)
shutil.move(_fasta, f_fasta)

Hi @nick-youngblut ,

I am sorry you are finding so many problems running SQANTI and I appreciate your suggestions. combine_split_runs assumes that all of the files of the individual SQANTI qc run exist based on the arguments that you used to run SQANTI (basically if --skipORF was included). So, the fact that the _orf file is not there is probably due to some kind of error in the individual run. It could indeed be checked if the file exists.

Regarding your shutil suggestion, I assume that shutil.move works as mv so I would be overwriting the final instead of appending the content of each file to it, but correct me if I am wrong.

Alejandro

Thanks @alexpan00 for the info!

Regarding your shutil suggestion, I assume that shutil.move works as mv so I would be overwriting the final instead of appending the content of each file to it, but correct me if I am wrong.

You are overwriting any existing files by using 'w' instead of 'a':

    f_fasta = open(corrFASTA, 'w')
    f_gtf = open(corrGTF, 'w')

Note that shutil.move will overwrite an existing destination file.

General comment: A major downside of using Python (or R) to develop a pipeline is that one must then handle all of the potential issues that can arise with calling subprocesses (e.g., checking if the executable is installed; checking for a non-zero exit status; checking that the output files were generated). This is a major reason why many have switched to using Nextflow or Snakemake, since such pipeline software has such features built in. Oddly, there seems to be a lot of various long read transcriptomics pipelines that do not use standard pipeline software (e.g., FLAMES, scywalker, scNanoGPS, and SQANTI3), even though one of the "foundational" workflows in this area is a Nextflow pipeline: wf-single-cell.

Yes, but opening the final files f_fasta and f_gtf is done only once and before the loop that combines the outputs of the individual runs (_fasta and _gtf). The change you suggested should be included in that loop, so using mv instead of write would overwrite the final file instead of writing the new lines to it.

Regarding your comment about using Nextflow, I completely agree that SQANTI3 would benefit from being a Nextflow pipeline instead of a script with some calls to subprocesses. However, the initial commit of the repo you linked is from 2 years ago, while the original SQANTI repo is from 8 years ago. I wasn't there back in the day, but I guess that Nextflow was not that popular. It could be done now, but that takes time, and we lack a person who has time to dedicate to that.

Good point on the loop. I was too focused on the other aspects of that code to see the obvious fact that you are aggregating via a loop.

SQANTI3 would benefit from being a Nextflow

Yeah, it's very hard to modify a mature codebase and change it into a pipeline. I've done it a couple of times, and it will take many hours -- even with using Github Copilot, ChatGPT, etc.
I'm just surprised that so many in the field of long read transcriptomics have gone the route of creating pipelines with python or R, which almost always leads to bugs with not checking subprocess exit codes, checking for correct output files from the subprocess, checking that the subprocess executables are actually in the user's PATH, etc. Then there's features that pipeline software provides, which is hard to do in R or python, such as distributed jobs (cloud or cluster) using various docker/conda envs and differing compute resources among the jobs.