waldronlab/curatedTCGAData

Question about CNA data

Closed this issue · 1 comments

Hi,

I would like to better understand what the CNA values are exactly and how they are transformed via simplifyTCGA() for a specific TCGA study. Is there documentation about these somewhere?

For example, check the following two matrices:

cancer_data = curatedTCGAData(diseaseCode = 'PAAD', assays = '*', version = '2.0.1', dry.run = FALSE)
cancer_data_simplified = TCGAutils::simplifyTCGA(cancer_data)

cna_snp_mat1 = t(assay(cancer_data[,,"PAAD_CNASNP-20160128"]))
cna_snp_mat2 = t(assay(cancer_data_simplified[,,"PAAD_CNASNP-20160128_simplified"]))
  • cna_snp_mat1 (genomic regions (rows) x patient samples (columns)) - what are the these values?
  • cna_snp_mat2 (genes/others (rows) x patient samples (columns)) - how are these transformed from the above (I think the code is this one). I am particularly interested in interpreting these values, i.e. does lower/negative values correspond to deletion and higher/positive to amplification somehow?

Hi John, @bblodfon
These are Segment_Mean values and are reduced with a weightedmean function. I've updated the documentation with details. waldronlab/TCGAutils@dd53882

I couldn't quickly find the documentation for the Broad Firehose pipeline but I saw that
https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/ has

The GDC further transforms these copy number values into segment mean values, which are equal to log2(copy-number/ 2). Diploid regions will have a segment mean of zero, amplified regions will have positive values, and deletions will have negative values.