nf-core/differentialabundance

Volcano plot in report shows less genes than in 'plots/differential/*/png/volcano.png'

Closed this issue · 10 comments

Description of the bug

The volcano plots in results/report/study.html#Differential_analysis differ to the ones in results/plots/differential/*/png/volcano.png, screenshots below.

I strongly assume its caused by the gtf I used. I downloaded gtfs from NCBI for non-standard organisms (bacteria). While I acknowledge that the gtfs might not follow standards (even if its coming from a popular resource) it is worrisome that the figure differ without any warning or error.

I havent investigated the issue, but I assume its related to qbic-pipelines/rnadeseq#216. Most likely I will tackle that problem in the next days and might be able to point to the reason.

Command used and terminal output

No response

Relevant files

results/report/study.html#Differential_analysis

Image

results/plots/differential/*/png/volcano.png

Image

System information

Version of nf-core/differentialabundance 1.5.0

grst commented

Seems like a duplicate of #270

I respectfully disagree.

  • I do have an overlapping problem: colors are missing and no labels -> amending the gtf with gene_name alleviates that
  • New problem: most genes are even not listed in the report.

So about the latter:
head of tables/differential/non-targeting_vs_control.deseq2.results_filtered.tsv -> 3 upregulated genes (highlighted)
Image
the plot in plots/differential/non-targeting_vs_control/png/volcano.png -> 3 upregulated genes (all good!)
Image
there is no table for Non-Targeting versus Control in Treatment in report/study.html, instead:

No significantly differential ‘up’ genes.
No significantly differential ‘down’ genes.

and additionally the volcono plot shows no significant genes:
Image

Any idea? Did I miss that problem in the other issue? If so, sorry for opening it again.

Edit - my struggle:

  • tables/annotation/*.anno.tsv actually contains too less rows, so if the tables are merged a lot of genes might be lost!
  • --features_gtf_feature_type gene retrieves much more, and the volcano plot seems fine now
  • indeed, tables/differential/non-targeting_vs_control.deseq2.results_filtered.tsv & plots/differential/non-targeting_vs_control/png/volcano.png (and other tables) agree now!

Ok, so its finally working using --features_gtf_feature_type gene. While this is a perfectly good solution, would it be possible (1) to make a big warning that something is wrong when the differential abundance table is getting smaller or (2) maybe even be more tolerant and merge while keeping all features from the un-annotated table?

Hello @d4straub! I need some more details to better understand your problem.
Even though your labels (differential status = TRUE/FALSE) were resolved, I could notice that the log2FoldChange thresholds are not consistent, the first volcano plot has log2Foldchange > 1 and the second one has log2FoldChange > 2. We may check that.

But more important, what did you mean by:
(2) maybe even be more tolerant and merge while keeping all features from the un-annotated table

About the warning, there's no specific number to say that there are few differentially expressed genes, that really depends on the experiment

Hi @alanmmobbs93 ,

first volcano plot has log2Foldchange > 1 and the second one has log2FoldChange > 2

true, at least the legend says so... I didnt spot that

But more important, what did you mean by:
(2) maybe even be more tolerant and merge while keeping all features from the un-annotated table

I assume (I didnt check any code) that the two tables (DESeq2's statistics and annotation table) are merged with settings that drop features that are not present in the annotation table and are therefore lost. If that would be the case, just changing the merging conditions to keep all rows in the statistics table might fix the issue.

About the warning, there's no specific number to say that there are few differentially expressed genes, that really depends on the experiment

Indeed, the number of significant genes depends on the experiments, but it should not differ for comparisons within a pipeline run in different tables/figures. I meant to check the number of (significant or just total) features in the DESeq2 statistics versus the number of features plotted. If thats not the same, something is off and a warning/error is warranted I think.
By the way, while doing the plots myself (because how can I trust them now?) I also figured out that any features/genes with padj = NA in results/tables/differential/*.deseq2.results.tsv are not displayed in the plot. Not sure if thats a feature or a bug, but it might make my proposed solution above not as simple as I thought.

Hello @d4straub again! Thanks for the long explanation, it's clear now

I also figured out that any features/genes with padj = NA in results/tables/differential/*.deseq2.results.tsv are not displayed in the plot

This makes sense because you have no value to plot. Maybe adding a warning or caption to print how many NA features are would help. At least in my own experience, this is informative.

just changing the merging conditions to keep all rows in the statistics table might fix the issue

If the feature is not present in the annotation table (I think you refer to the tsv created after the reference GTF), I don't think why it would enter into DESeq module. In addition, some rows may be filtered out because of low expression before getting into DESeq2, so they wouldn't be there neither

I'll work on this, in the meantime, just let me know if you have any new comment

Thanks for looking into this!

(I think you refer to the tsv created after the reference GTF)

Apologies for being ambiguous, indeed, I refer to results/tables/annotation/*.anno.tsv that is generated based on the GTF.

If the feature is not present in the annotation table (I think you refer to the tsv created after the reference GTF), I don't think why it would enter into DESeq module.

I dont really get that. Filtering is fine, but why would features be omitted from the count table based on missing values in the GTF? I understood and welcome that the metadata (via --input) leads to filtering columns in the count table (via --matrix). Is the idea that the GTF also is filtering the matrix file? If so, that actually seems to make sense, but when a GTF doesnt follow what the pipeline expects (which happens all the time for me - bacterial annotations seem to go against the assumptions made in this pipeline, even if retrieved from reliable resources such as NCBI) then this leads to unreliable results visualization in the report which makes it unusable unfortunately.

Is the idea that the GTF also is filtering the matrix file?

@d4straub The counts table is not filtered by the GTF. However, I'm not an expert on bacteria and with my previous message I was wondering why you would have a counts table with genes that are not present in your reference GTF, so I can better understand your context.

But in order to be clear, your problem is that you got fewer genes on the volcano plots and DESeq2 output tables than on your input, am I right? I got a little confused on which exactly the issue is.
Please, if you can, provide the execution code and some example files so I can reproduce it. How many genes are there in your reference GTF and how many are there on the anno.tsv file. So, if there's and issue with the NCBI bacteria GTFs we can work on that.

But in order to be clear, your problem is that you got fewer genes on the volcano plots and DESeq2 output tables than on your input, am I right? I got a little confused on which exactly the issue is.

No.

  • the DESeq2 output tables in results/tables/differential/ are fine
  • the plots in results/plots/differential/*/png/volcano.png are fine
  • the plots and tables in results/report/study.html showing way less features than the previous tables & plots, as detailed in above

Please, if you can, provide the execution code and some example files so I can reproduce it. How many genes are there in your reference GTF and how many are there on the anno.tsv file. So, if there's and issue with the NCBI bacteria GTFs we can work on that.

Unfortunately I am not at liberty to share the data. But

  • the GTF can be found here.
  • The GTF produced a problem as in #270
  • and I fixed it with
#produce a gene_name for all genes, for displaying reasons only
zcat GCF_030316195.1_ASM3031619v1_genomic.gtf.gz | sed 's/gene "/gene_name "/' | sed '/gene_name "/!s/gene_id[^;]*;/& \0/' | sed '/gene_id.*gene_id/s/gene_id/gene_name/' | gzip > GCF_030316195.1_ASM3031619v1_genomic_genename.gtf.gz

Next I run nf-core/differentialabundance with:
NXF_VER=24.04.4 nextflow run nf-core/differentialabundance -r 1.5.0 -profile cfc --input 'metadata.csv' --matrix 'stringtie.tximport.gene.counts.tsv' --transcript_length_matrix 'stringtie.tximport.gene.length.tsv' --contrasts 'contrasts.csv' --gtf "GCF_030316195.1_ASM3031619v1_genomic_genename.gtf.gz" --outdir results

  • and ended up with too less features in plots and tables in results/report/study.html
  • appending --features_gtf_feature_type gene solved the problem. But I only saw the discrepancy because it was so brutal, otherwise I would have likely missed it. Therefore I ask for a warning/error/change of behavior.

Please do not make me detail how I had to trick nf-core/rnaseq into doing what I want, thats a nightmare.

THanks @alanmmobbs93 for fixing this :)