Input Data Requirements Not Specified by Vignette

Question

Input Data Requirements Not Specified by Vignette

DarioS opened this issue 4 years ago · 6 comments

The vignette immediately launches into an analysis but I would like to see some explanation of RNA-seq abundance summaries. Should counts, FPKM, TPM or something else be provided? What should definitely not be input by the user? What if the user already has an edgeR pipeline? I see the example uses DESeq2 but it's unclear what other units of measurement are valid to use. Also, Section 1 has a link to bioRxiv but it could be updated to refer to Nature Communications instead. log-scale microarray values and log-scale RNA-seq counts are quite different numbers in scale. How can the same model work for both?

Answer 1 · 2021-02-12T13:56:07.000Z

to compute the scores, Progeny does a matrix multiplication between the progeny weight matrix and the users input statistics. The progeny weights were estimated from linear model parameters trained on differential z-scores between pertubations and control experiments. The scores are then normalised using a gene permutation approach.

Thus, as for regular gene set enrichment approaches, any differential statistic (t-value, foldchange, etc...) is supposed to work. However, the score interpretation will depend on the input statistic.

In the same manner, when running progeny directly on a count or micro-array matrix, as long as the values are normalised to follow more or less a bell shaped distribution, the model should work.

It is however up to the users discretion to interpret the value of a given pathway score in the light of knowing that this score represent a normalised weighted mean (where weights are the progeny matrix weights) of the input statistic.

Answer 2 · 2021-02-12T23:00:02.000Z

I wonder if you are aware of:

Transcript Length Bias in RNA-seq Data Confounds Systems Biology (2009) Biology Direct
Gene Ontology Analysis for RNA-seq: Accounting for Selection Bias (2010) Genome Biology

Genes which are longer will have systematically higher counts becase the read length is constant. I don't think that the instructions are suitable for RNA-seq. Counts should be advised against in favour of F.P.K.M. or T.P.M. which adjust for gene length.

Answer 3 · 2021-02-14T08:18:41.000Z

Thank you for pointing this out. This is indeed something we had to consider. This biais can be a problem when pathway scores are interpreted as they are in a single sample. We will make this clearer.

Answer 4 · 2021-02-15T13:33:42.000Z

Hi all,

Also see #7 for a discussion I had with Michael a while back on this. I would also lean towards normalizing by gene length, or do both and have a look if results are consistent.

Kind regards,
Clemens

Answer 5 · 2021-02-15T22:00:07.000Z

Thanks for alerting me to your identical question a couple of years ago. This important detail should be discussed in the vignette.

Answer 6 · 2021-02-23T16:04:45.000Z

For a TCGA cohort, I've recently compared PROGENy results for RSEM_normalized, log2(RSEM_normalized), and 10^6*RSEM_scaled_estimate, which I understand is TPM. It would be helpful to have some guidance in the vignette on different inputs, and on how to think about pathway activities that are returned as positive, vs. negative.