GTF_2_GFF3 fails with latest Gencode GTF for human
Closed this issue · 21 comments
Description of the bug
The pipeline fails when running with --rmats
and the latest GTF file for human for Gencode: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz, even though --gencode
is specified in the command.
GTF file source: https://www.gencodegenes.org/human/
I'm including a command below that reproduces the error. The sample sheet and contrast sheet are based on the pipeline's test
profile, but I've customized the GTF and reference genome.
Command used and terminal output
nextflow run nf-core/rnasplice -r dev -latest -profile docker --outdir test-small-rnasplice --gencode --genome GRCh38 \
--input https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/samplesheet.csv \
--contrasts https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/contrastsheet.csv \
--gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz --rmats --rmats_read_len 75 --rmats_paired_stats --aligner star_salmon --dexseq_exon
ERROR ~ Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:GTF_2_GFF3'
Caused by:
Process `NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:GTF_2_GFF3` terminated with an error exit status (139)
Command executed:
gffread gencode.v43.annotation.gtf -L | awk -F' ' -vOFS=' ' '{ gsub("transcript", "mRNA", $3); print}' > gencode.v43.annotation_genes.gff3
cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:GTF_2_GFF3":
gffread: $(gffread --version 2>&1)
END_VERSIONS
Command exit status:
139
Command output:
(empty)
Command error:
.command.sh: line 2: 28 Segmentation fault (core dumped) gffread gencode.v43.annotation.gtf -L
30 Done | awk -F' ' -vOFS=' ' '{ gsub("transcript", "mRNA", $3); print}' > gencode.v43.annotation_genes.gff3
Relevant files
No response
System information
No response
Hi @amizeranschi.
Thank you for reporting the issue. We will let you know as soon as possible
For the record, the referenced commits by @dkoppstein have solved my issue. Would be great to see this fix merged into the base pipeline.
Unfortunately, my test run still ended up failing, due to a different reason. This is the updated command that I ran:
nextflow run dkoppstein/rnasplice -r dev -latest -profile docker --outdir test-small-rnasplice --gencode --genome GRCh38 \
--input https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/samplesheet.csv \
--contrasts https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/contrastsheet.csv \
--gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz --rmats --rmats_read_len 75 --rmats_paired_stats --aligner star_salmon --dexseq_exon
And below is the new issue. @valentinoruggieri any help with this new error would be much appreciated.
ERROR ~ Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (ERR188428)'
Caused by:
Process `NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (ERR188428)` terminated with an error exit status (105)
Command executed:
STAR \
--genomeDir STARIndex \
--readFilesIn input1/ERR188428_1_val_1.fq.gz input2/ERR188428_2_val_2.fq.gz \
--runThreadN 12 \
--outFileNamePrefix ERR188428. \
\
--sjdbGTFfile gencode.v43.annotation.gtf \
--outSAMattrRGline 'ID:ERR188428' 'SM:ERR188428' \
--quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand gunzip -c --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend
if [ -f ERR188428.Unmapped.out.mate1 ]; then
mv ERR188428.Unmapped.out.mate1 ERR188428.unmapped_1.fastq
gzip ERR188428.unmapped_1.fastq
fi
if [ -f ERR188428.Unmapped.out.mate2 ]; then
mv ERR188428.Unmapped.out.mate2 ERR188428.unmapped_2.fastq
gzip ERR188428.unmapped_2.fastq
fi
cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN":
star: $(STAR --version | sed -e "s/STAR_//g")
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
gawk: $(echo $(gawk --version 2>&1) | sed 's/^.*GNU Awk //; s/, .*$//')
END_VERSIONS
Command exit status:
105
Command output:
STAR --genomeDir STARIndex --readFilesIn input1/ERR188428_1_val_1.fq.gz input2/ERR188428_2_val_2.fq.gz --runThreadN 12 --outFileNamePrefix ERR188428. --sjdbGTFfile gencode.v43.annotation.gtf --outSAMattrRGline ID:ERR188428 SM:ERR188428 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand gunzip -c --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend
STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
Jul 18 09:03:48 ..... started STAR run
Jul 18 09:03:48 ..... loading genome
Command error:
STAR --genomeDir STARIndex --readFilesIn input1/ERR188428_1_val_1.fq.gz input2/ERR188428_2_val_2.fq.gz --runThreadN 12 --outFileNamePrefix ERR188428. --sjdbGTFfile gencode.v43.annotation.gtf --outSAMattrRGline ID:ERR188428 SM:ERR188428 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand gunzip -c --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend
STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
Jul 18 09:03:48 ..... started STAR run
Jul 18 09:03:48 ..... loading genome
EXITING because of FATAL ERROR: Genome version: 20201 is INCOMPATIBLE with running STAR version: 2.7.9a
SOLUTION: please re-generate genome from scratch with running version of STAR, or with version: 2.7.4a
Jul 18 09:03:48 ...... FATAL ERROR, exiting
@amizeranschi thanks for the confirmation. I don't have access to a linux box with Docker to run the formal tests, so am waiting on opening a PR, if you or one of the devs can test run -profile test,docker
on my branch it would be great.
However, I am also encountering another issue downstream at the index_gff step with a Gencode GTF... will try tinkering around a bit, then may need to submit another bug report.
For your STAR issue in the short term, it seems like the AWS iGenomes are out of date -- I would suggest using the latest ENSEMBL or GENCODE fasta and gtf files, and specifying them on the command line.
OK, I will give this a try. Either way, I guess the STAR index will have to be rebuilt, correct? I will try using the latest FASTA and GTF from Gencode, I just saw that they recently made a new release, v. 44.
OK, then I'll add --save_reference to my command so that I can later reuse the new STAR index. Thanks a lot for your help.
@amizeranschi @dkoppstein Thank you both for reviewing and testing the code.
As you mentioned above, updating gffread from 0.12.1 to 0.12.7 solves the issue in our test too. In addition, we are evaluating to add the option "--keep-gene" in the gffread script (e.g. gffread $gtf -L --keep-genes ) since it seems that "genes" are not automatically added in the gff3 file, and this might cause problems in some downstream processes.
Regarding the STAR issue, as you pointed out it depends on the STAR version (2.5.1b 2016) used to generate the index in the AWS iGenome and the STAR version used for the mapping (version 2.7.9a). We are looking for a way to take into consideration this discrepancy.
However, passing the genome file with the option '--fasta' skips the issue since the index is rebuilt using the right STAR version.
OK, changing the command to the one below got things moving forward for me:
nextflow run dkoppstein/rnasplice -r dev -latest -profile docker --outdir test-small-rnasplice --gencode --igenomes_ignore --genome null --save_reference \
--input https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/samplesheet.csv \
--contrasts https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/contrastsheet.csv \
--fasta https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz \
--gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz \
--rmats --rmats_read_len 75 --rmats_paired_stats --aligner star_salmon --dexseq_exon
@dkoppstein I am now getting some GFF-related warnings and errors, but I'm not sure if this is what you were referring to earlier. This is what I get:
ERROR ~ Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN (1)'
Caused by:
Process `NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN (1)` terminated with an error exit status (1)
Command executed:
miso --run index ERR188428_sorted.bam --output-dir miso_data/ERR188428 --read-len 75
cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN":
misopy: $(python -c "import pkg_resources; print(pkg_resources.get_distribution('misopy').version)")
END_VERSIONS
Command exit status:
1
Command output:
Using MISO settings file: /usr/local/lib/python2.7/site-packages/misopy/settings/miso_settings.txt
Computing Psi values...
- GFF index: index
- BAM: ERR188428_sorted.bam
- Read length: 75
- Output directory: miso_data/ERR188428
Checking your GFF annotation and BAM for mismatches...
Checking if BAM has mixed read lengths...
07/18/2023 11:25:14 AM - miso_main - WARNING - Found mixed length reads in your BAM file: ERR188428_sorted.bam
MISO does not support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your GFF annotation file and your BAM file might not have matching headers (chromosome names.) If this is the case, your run will fail as no reads from the BAM could be matched up with your annotation.
07/18/2023 11:25:19 AM - miso_main - WARNING - Please see:
http://genes.mit.edu/burgelab/miso/docs/
for more information.
07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your BAM file (ERR188428_sorted.bam) contains 'chr' chromosomes (UCSC-style) while your GFF annotation (index) does not.
07/18/2023 11:25:19 AM - miso_main - WARNING - The first few BAM chromosomes were: chr7,chr6,chr5,chr4,chr3,chr2,chr1,chr9,chr8,chr12,chr11,chr10
BAM references:
('chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM', 'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1', 'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1', 'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1', 'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1', 'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1', 'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1', 'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1', 'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1', 'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1', 'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1', 'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1', 'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1', 'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1', 'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1', 'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1', 'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1', 'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1', 'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1', 'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1', 'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1', 'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1', 'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1', 'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1', 'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1', 'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1', 'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1', 'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1', 'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1', 'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1', 'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1', 'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1', 'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1', 'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1', 'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1', 'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1', 'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1', 'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1', 'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1', 'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1', 'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1', 'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1', 'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1', 'KI270757.1')
07/18/2023 11:25:19 AM - miso_main - WARNING - The first few GFF chromosomes were:
07/18/2023 11:25:19 AM - miso_main - WARNING - Run is likely to fail or produce empty output. Proceeding anyway...
Mapping genes to their indexed GFF representation, using index
Searching for index/genes_to_filenames.shelve..
- File not found.
Skipping: /home/ubuntu/work/dc/eeb28063f47f2t support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your GFF annotation file and your BAM file might not have matching headers (chromosome names.) If this is the case, your run will fail as no reads from the BAM could be matched up with your annotation.
07/18/2023 11:25:19 AM - miso_main - WARNING - Please see:
http://genes.mit.edu/burgelab/miso/docs/
for more information.
07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your BAM file (ERR188428_sorted.bam) contains 'chr' chromosomes (UCSC-style) while your GFF annotation (index) does not.
07/18/2023 11:25:19 AM - miso_main - WARNING - The first few BAM chromosomes were: chr7,chr6,chr5,chr4,chr3,chr2,chr1,chr9,chr8,chr12,chr11,chr10
BAM references:
('chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM', 'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1', 'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1', 'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1', 'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1', 'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1', 'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1', 'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1', 'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1', 'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1', 'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1', 'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1', 'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1', 'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1', 'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1', 'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1', 'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1', 'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1', 'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1', 'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1', 'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1', 'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1', 'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1', 'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1', 'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1', 'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1', 'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1', 'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1', 'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1', 'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1', 'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1', 'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1', 'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1', 'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1', 'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1', 'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1', 'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1', 'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1', 'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1', 'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1', 'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1', 'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1', 'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1', 'KI270757.1')
07/18/2023 11:25:19 AM - miso_main - WARNING - The first few GFF chromosomes were:
07/18/2023 11:25:19 AM - miso_main - WARNING - Run is likely to fail or produce empty output. Proceeding anyway...
Mapping genes to their indexed GFF representation, using index
Searching for index/genes_to_filenames.shelve..
- File not found.
Skipping: index/genes.gff
Skipping: index/compressed_ids_to_genes.shelve.bak
Skipping: index/genes_to_filenames.shelve.bak
Skipping: index/genes_to_filenames.shelve.dat
Skipping: index/compressed_ids_to_genes.shelve.dir
Skipping: index/genes_to_filenames.shelve.dir
Skipping: index/compressed_ids_to_genes.shelve.dat
07/18/2023 11:25:34 AM - miso_main - ERROR - No genes to run on. Did you pass me the wrong path to your index GFF directory? Or perhaps your indexed GFF directory is empty?
Command error:
Using MISO settings file: /usr/local/lib/python2.7/site-packages/misopy/settings/miso_settings.txt
Computing Psi values...
- GFF index: index
- BAM: ERR188428_sorted.bam
- Read length: 75
- Output directory: miso_data/ERR188428
Checking your GFF annotation and BAM for mismatches...
Checking if BAM has mixed read lengths...
07/18/2023 11:25:14 AM - miso_main - WARNING - Found mixed length reads in your BAM file: ERR188428_sorted.bam
MISO does not support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your GFF annotation file and your BAM file might not have matching headers (chromosome names.) If this is the case, your run will fail as no reads from the BAM could be matched up with your annotation.
07/18/2023 11:25:19 AM - miso_main - WARNING - Please see:
http://genes.mit.edu/burgelab/miso/docs/
for more information.
07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your BAM file (ERR188428_sorted.bam) contains 'chr' chromosomes (UCSC-style) while your GFF annotation (index) does not.
07/18/2023 11:25:19 AM - miso_main - WARNING - The first few BAM chromosomes were: chr7,chr6,chr5,chr4,chr3,chr2,chr1,chr9,chr8,chr12,chr11,chr10
BAM references:
('chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM', 'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1', 'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1', 'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1', 'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1', 'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1', 'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1', 'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1', 'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1', 'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1', 'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1', 'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1', 'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1', 'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1', 'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1', 'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1', 'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1', 'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1', 'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1', 'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1', 'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1', 'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1', 'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1', 'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1', 'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1', 'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1', 'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1', 'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1', 'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1', 'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1', 'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1', 'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1', 'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1', 'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1', 'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1', 'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1', 'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1', 'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1', 'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1', 'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1', 'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1', 'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1', 'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1', 'KI270757.1')
07/18/2023 11:25:19 AM - miso_main - WARNING - The first few GFF chromosomes were:
07/18/2023 11:25:19 AM - miso_main - WARNING - Run is likely to fail or produce empty output. Proceeding anyway...
Mapping genes to their indexed GFF representation, using index
Searching for index/genes_to_filenames.shelve..
- File not found.
Skipping: /home/ubuntu/work/dc/eeb28063f47f2t support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your GFF annotation file and your BAM file might not have matching headers (chromosome names.) If this is the case, your run will fail as no reads from the BAM could be matched up with your annotation.
07/18/2023 11:25:19 AM - miso_main - WARNING - Please see:
http://genes.mit.edu/burgelab/miso/docs/
for more information.
07/18/2023 11:25:19 AM - miso_main - WARNING - It looks like your BAM file (ERR188428_sorted.bam) contains 'chr' chromosomes (UCSC-style) while your GFF annotation (index) does not.
07/18/2023 11:25:19 AM - miso_main - WARNING - The first few BAM chromosomes were: chr7,chr6,chr5,chr4,chr3,chr2,chr1,chr9,chr8,chr12,chr11,chr10
BAM references:
('chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM', 'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1', 'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1', 'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1', 'GL000221.1', 'GL000224.1', 'GL000225.1', 'GL000226.1', 'KI270302.1', 'KI270303.1', 'KI270304.1', 'KI270305.1', 'KI270310.1', 'KI270311.1', 'KI270312.1', 'KI270315.1', 'KI270316.1', 'KI270317.1', 'KI270320.1', 'KI270322.1', 'KI270329.1', 'KI270330.1', 'KI270333.1', 'KI270334.1', 'KI270335.1', 'KI270336.1', 'KI270337.1', 'KI270338.1', 'KI270340.1', 'KI270362.1', 'KI270363.1', 'KI270364.1', 'KI270366.1', 'KI270371.1', 'KI270372.1', 'KI270373.1', 'KI270374.1', 'KI270375.1', 'KI270376.1', 'KI270378.1', 'KI270379.1', 'KI270381.1', 'KI270382.1', 'KI270383.1', 'KI270384.1', 'KI270385.1', 'KI270386.1', 'KI270387.1', 'KI270388.1', 'KI270389.1', 'KI270390.1', 'KI270391.1', 'KI270392.1', 'KI270393.1', 'KI270394.1', 'KI270395.1', 'KI270396.1', 'KI270411.1', 'KI270412.1', 'KI270414.1', 'KI270417.1', 'KI270418.1', 'KI270419.1', 'KI270420.1', 'KI270422.1', 'KI270423.1', 'KI270424.1', 'KI270425.1', 'KI270429.1', 'KI270435.1', 'KI270438.1', 'KI270442.1', 'KI270448.1', 'KI270465.1', 'KI270466.1', 'KI270467.1', 'KI270468.1', 'KI270507.1', 'KI270508.1', 'KI270509.1', 'KI270510.1', 'KI270511.1', 'KI270512.1', 'KI270515.1', 'KI270516.1', 'KI270517.1', 'KI270518.1', 'KI270519.1', 'KI270521.1', 'KI270522.1', 'KI270528.1', 'KI270529.1', 'KI270530.1', 'KI270538.1', 'KI270539.1', 'KI270544.1', 'KI270548.1', 'KI270579.1', 'KI270580.1', 'KI270581.1', 'KI270582.1', 'KI270583.1', 'KI270584.1', 'KI270587.1', 'KI270588.1', 'KI270589.1', 'KI270590.1', 'KI270591.1', 'KI270593.1', 'KI270706.1', 'KI270707.1', 'KI270708.1', 'KI270709.1', 'KI270710.1', 'KI270711.1', 'KI270712.1', 'KI270713.1', 'KI270714.1', 'KI270715.1', 'KI270716.1', 'KI270717.1', 'KI270718.1', 'KI270719.1', 'KI270720.1', 'KI270721.1', 'KI270722.1', 'KI270723.1', 'KI270724.1', 'KI270725.1', 'KI270726.1', 'KI270727.1', 'KI270728.1', 'KI270729.1', 'KI270730.1', 'KI270731.1', 'KI270732.1', 'KI270733.1', 'KI270734.1', 'KI270735.1', 'KI270736.1', 'KI270737.1', 'KI270738.1', 'KI270739.1', 'KI270740.1', 'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270744.1', 'KI270745.1', 'KI270746.1', 'KI270747.1', 'KI270748.1', 'KI270749.1', 'KI270750.1', 'KI270751.1', 'KI270752.1', 'KI270753.1', 'KI270754.1', 'KI270755.1', 'KI270756.1', 'KI270757.1')
07/18/2023 11:25:19 AM - miso_main - WARNING - The first few GFF chromosomes were:
07/18/2023 11:25:19 AM - miso_main - WARNING - Run is likely to fail or produce empty output. Proceeding anyway...
Mapping genes to their indexed GFF representation, using index
Searching for index/genes_to_filenames.shelve..
- File not found.
Skipping: index/genes.gff
Skipping: index/compressed_ids_to_genes.shelve.bak
Skipping: index/genes_to_filenames.shelve.bak
Skipping: index/genes_to_filenames.shelve.dat
Skipping: index/compressed_ids_to_genes.shelve.dir
Skipping: index/genes_to_filenames.shelve.dir
Skipping: index/compressed_ids_to_genes.shelve.dat
07/18/2023 11:25:34 AM - miso_main - ERROR - No genes to run on. Did you pass me the wrong path to your index GFF directory? Or perhaps your indexed GFF directory is empty?
adding the option "--keep-gene" in the gffread script of the GTF_2_GFF3 module ( gffread $gtf -L --keep-genes | awk -F'\t' -vOFS='\t' '{ gsub("transcript", "mRNA", \$3); print}' > ${gtf.baseName}_genes.gff3
) should solve the issue
@dkoppstein I also ran a built-in test earlier, as you suggested:
nextflow run dkoppstein/rnasplice -r dev -latest -profile test,docker
and the outcome was similar to my test above. Would you be able to implement the fix suggested by @valentinoruggieri?
ERROR ~ Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN (1)'
Caused by:
Process `NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN (1)` terminated with an error exit status (1)
Command executed:
miso --run index ERR188383_sorted.bam --output-dir miso_data/ERR188383 --read-len 75
cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_RUN":
misopy: $(python -c "import pkg_resources; print(pkg_resources.get_distribution('misopy').version)")
END_VERSIONS
Command exit status:
1
Command output:
MISO (Mixture of Isoforms model)
Probabilistic analysis of RNA-Seq data for detecting differential isoforms
Use --help argument to view options.
Using MISO settings file: /usr/local/lib/python2.7/site-packages/misopy/settings/miso_settings.txt
Computing Psi values...
- GFF index: index
- BAM: ERR188383_sorted.bam
- Read length: 75
- Output directory: miso_data/ERR188383
Checking your GFF annotation and BAM for mismatches...
Checking if BAM has mixed read lengths...
07/18/2023 12:11:06 PM - miso_main - WARNING - Found mixed length reads in your BAM file: ERR188383_sorted.bam
MISO does not support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
Mapping genes to their indexed GFF representation, using index
Searching for index/genes_to_filenames.shelve..
- File not found.
Skipping: index/genes.gff
Skipping: index/compressed_ids_to_genes.shelve.bak
Skipping: index/genes_to_filenames.shelve.bak
Skipping: index/genes_to_filenames.shelve.dat
Skipping: index/compressed_ids_to_genes.shelve.dir
Skipping: index/genes_to_filenames.shelve.dir
Skipping: index/compressed_ids_to_genes.shelve.dat
07/18/2023 12:11:11 PM - miso_main - ERROR - No genes to run on. Did you pass me the wrong path to your index GFF directory? Or perhaps your indexed GFF directory is empty?
Command error:
MISO (Mixture of Isoforms model)
Probabilistic analysis of RNA-Seq data for detecting differential isoforms
Use --help argument to view options.
Using MISO settings file: /usr/local/lib/python2.7/site-packages/misopy/settings/miso_settings.txt
Computing Psi values...
- GFF index: index
- BAM: ERR188383_sorted.bam
- Read length: 75
- Output directory: miso_data/ERR188383
Checking your GFF annotation and BAM for mismatches...
Checking if BAM has mixed read lengths...
07/18/2023 12:11:06 PM - miso_main - WARNING - Found mixed length reads in your BAM file: ERR188383_sorted.bam
MISO does not support mixed read lengths. Please trim your reads to a uniform length and re-run or run MISO separately on each read length. Proceeding anyway, though this may result in errors or inability to match reads to your annotation.
Read lengths were: 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
Mapping genes to their indexed GFF representation, using index
Searching for index/genes_to_filenames.shelve..
- File not found.
Skipping: index/genes.gff
Skipping: index/compressed_ids_to_genes.shelve.bak
Skipping: index/genes_to_filenames.shelve.bak
Skipping: index/genes_to_filenames.shelve.dat
Skipping: index/compressed_ids_to_genes.shelve.dir
Skipping: index/genes_to_filenames.shelve.dir
Skipping: index/compressed_ids_to_genes.shelve.dat
07/18/2023 12:11:11 PM - miso_main - ERROR - No genes to run on. Did you pass me the wrong path to your index GFF directory? Or perhaps your indexed GFF directory is empty?
@amizeranschi I just pushed the suggested fix to my fork.
Thanks, it seems like that did the trick. The built-in test ran fine now. I'll also run my own test above and see how that turns out.
Unfortunately, my test still ran into an error. Reminder, this is the command I'm running:
nextflow run dkoppstein/rnasplice -r dev -latest -profile docker --outdir test-small-rnasplice --gencode --igenomes_ignore --genome null --save_reference \
--input https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/samplesheet.csv \
--contrasts https://raw.githubusercontent.com/nf-core/test-datasets/rnasplice/samplesheet/contrastsheet.csv \
--fasta https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz \
--gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz \
--rmats --rmats_read_len 75 --rmats_paired_stats --aligner star_salmon --dexseq_exon
And this was the outcome, with the latest version of @dkoppstein's fork. Any idea about this error, @valentinoruggieri?
Caused by:
Process `NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_SASHIMI (1)` terminated with an error exit status (1)
Command executed:
sashimi_plot --plot-event ENSG00000004961 index miso_settings.txt --output-dir sashimi
cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASPLICE:RNASPLICE:VISUALISE_MISO:MISO_SASHIMI":
python: $(python --version | sed "s/Python //g")
misopy: $(python -c "import pkg_resources; print(pkg_resources.get_distribution('misopy').version)")
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
/usr/local/lib/python2.7/site-packages/matplotlib/cbook/deprecation.py:107: MatplotlibDeprecationWarning: The mpl_toolkits.axes_grid module was deprecated in version 2.1. Use mpl_toolkits.axes_grid1 and mpl_toolkits.axisartist provies the same functionality instead.
warnings.warn(message, mplDeprecation, stacklevel=1)
Traceback (most recent call last):
File "/usr/local/bin/sashimi_plot", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/misopy/sashimi_plot/sashimi_plot.py", line 276, in main
plot_label=plot_label)
File "/usr/local/lib/python2.7/site-packages/misopy/sashimi_plot/sashimi_plot.py", line 142, in plot_event
%(event_name, pickle_dir)
Exception: Event ENSG00000004961 not found in pickled directory index. Are you sure this is the right directory for the event?
@amizeranschi it seems the error arises because the "miso_genes" IDs in nextflow.config are slightly different from those of the genome version you used (e.g. ENSG00000004961 vs ENSG00000004961.15) . You should change them accordingly.
I'm afraid I don't fully follow you. Could you please be a bit more specific? Where and what should I change in regards to those "miso_genes" IDs?
rnasplice relies on "miso_genes" params to specify the gene IDs you want to plot via miso/sashimi_plot function. By default, it includes 3 genes ('ENSG00000004961, ENSG00000005302, ENSG00000147403') (check the nextflow.config file).
Users need to specify the gene IDs they want to plot accordingly to the genome annotation they are using.
For example, in the gencode version you are using (version 44), genes ID are coded with an extra suffix (.*) like "ENSG00000004961.15". So to correctly instruct miso, you need to use a list of gene IDs that corresponds to that version.
You can do that:
- by modifying the gene IDs in the nextflow.config file
(for example replacing the default values with 'ENSG00000004961.15, ENSG00000005302.19, ENSG00000147403.18'); - by adding an argument to your command line
--miso_genes 'ENSG00000004961.15'
;
[https://nf-co.re/rnasplice/dev/parameters] for more details.
I see, thanks a lot for the added details. I have added the following to my command line:
--miso_genes "ENSG00000004961.15, ENSG00000005302.19, ENSG00000147403.18"
and everything worked perfectly.
It might also be helpful to update the documentation to reflect the actual default value for --miso_genes
from nextflow.config
. The website currently lists only two genes: ENSG00000004961, ENSG00000005302
:
I would also suggest changing these values automatically if --gencode
is specified.
Good point @dkoppstein. I opened a separate issue report here: #78
@dkoppstein could you please create a PR with your fixes?