mskcc/RNAseqDB

quantile normalization unclear, script missing

Closed this issue · 4 comments

bzip2 commented

Hello.

I am trying to follow your quantile normalization, which is mentioned in your readme:

unnormalized: the gene expression levels calculated from fpkm of RSEM’s output. The data matrices here, however, were not the direct output of RSEM. They underwent quantile normalization, but were not corrected for batch effects.

This appears to be done by the function QuantileNorm in calc-expression.pl by calling quartile_norm.pl, but this script is missing from your repo. It looks like you forgot to include this and another script because they were in another folder, so it would be great if you can add them.

In any case, the call suggests that you are setting the 75th percentile to 1000, but I don't see this in either the "unnormalized" or "normalized" files, either for a single sample or for all samples, whether I include all values or only the non-zero values. For individual samples in the unnormalized files, the 75th percentile value varies is generally between 350 and 700. Can you point me to a specific data file and describe which percentile should have what value, and/or could you update your readme so your procedure is clearer?

Thank you.

Thanks for your interest in this project. quartile_norm.pl was not developed by us, but was downloaded from UNC (University of North Carolina) Bioinformatics Core at https://github.com/mozack/ubu/blob/master/src/perl/quartile_norm.pl

We simply used the default parameters to run this script for all samples. The value 1000 here does not refer to expression level, but target to use (see quartile_norm.pl for detail).

bzip2 commented

Hello.

Thanks for pointing me to the quantile normalization code. I get the expected results if I run that script on a file of counts from your data, but your corresponding "unnormalized" results don't match. Either your description is incomplete or you have a bug. On a related note, for both TCGA and GTEx data, the first sample in the counts and "unnormalized" files ends up as the last sample in the "normalized" files, while the other samples shift by one. This suggests that your code is doing something inadvertent.

1000 does indeed refer to an expression level. It's the target 75th percentile expression level after rescaling. One detail from their code not mentioned in your readme is that the percentile is determined from values >=1.

I asked you to point to a specific example is so that I could try to understand if I was misinterpreting your results. To give you a specific example with GTEx data, although I also did this with TCGA:

  1. Run zcat expected_count/cervix-rsem-count-gtex.txt.gz > in.

  2. Run quartile_norm.pl -s 1 -c -1 -o out in. It doesn't matter that the file contains the Entrez ID since it's a number.

  3. Let's use the first sample (third column), GTEX-S32W-1626-SM-4AD6G. If you sort the input file by the third column and keep only values >=1 for that sample, you get 17845 lines and the 75th percentile value is on line 13384, which is ATXN1.

  4. Do grep 1000.0 out and you should get ATXN1 as the gene and 1000.0000 as the rescaled value for the first sample (third column).

  5. Look in unnormalized/cervix-rsem-fpkm-gtex.txt.gz. The first sample is GTEX-S32W-1626-SM-4AD6G. Grep ATXN1 in this file and its expression value for the first sample is 226.7153.

  6. Look in normalized/cervix-rsem-fpkm-gtex.txt.gz. GTEX-S32W-1626-SM-4AD6G is now the last sample.

It looks like your results might be scrambled. What am I missing?

I am not surprised at your results. No, they will never match, because 'expected_count' is the number of reads that were mapped to genes and 'unnormalized' is RPKM.

Sorry I forget to mention in my last response that we discarded the results of quantile normalization by calc-expression.pl. It is the script post-process.pl that was used by us to run quantile normalization.

I will improve the description of our documentation when I have time - too busy recently.

Hi bzip2,

  1. I agree with you that the author used upper-quartile normalization, setting 1000 to the 75-percentile and ignoring all genes with value <1.
  2. It is not possible to convert gene-level-read-count to gene-level-rpkm for genes with multiple isoforms.
  3. The values in the "unnormalized" folder are very likely raw FPKM, although the raw FPKM files are deleted after quartile-normalization according to the pipeline. They are NOT quantile-normalized and they are NOT quartile-normalized. Take bladder "blca-rsem-fpkm-tcga-t.txt.gz" for example:
  • summary(tcga[,3])
  • Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    
  •  0.0      3.6     94.4   1078.4    502.1 333638.0 
    
  • summary(tcga[,4])
  • Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    
  •  0.0      5.3    145.7   1236.4    603.5 620948.8 
    
  • summary(tcga[,5])
  • Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    
  •  0.0      0.9     61.4   1284.1    410.5 381319.3 
    
  • summary(tcga[,6])
  • Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    
  •  0.0      7.0    142.9   1203.2    578.4 623815.9