Question about CNA data
Closed this issue · 1 comments
Hi,
I would like to better understand what the CNA values are exactly and how they are transformed via simplifyTCGA()
for a specific TCGA study. Is there documentation about these somewhere?
For example, check the following two matrices:
cancer_data = curatedTCGAData(diseaseCode = 'PAAD', assays = '*', version = '2.0.1', dry.run = FALSE)
cancer_data_simplified = TCGAutils::simplifyTCGA(cancer_data)
cna_snp_mat1 = t(assay(cancer_data[,,"PAAD_CNASNP-20160128"]))
cna_snp_mat2 = t(assay(cancer_data_simplified[,,"PAAD_CNASNP-20160128_simplified"]))
cna_snp_mat1
(genomic regions (rows) x patient samples (columns)) - what are the these values?cna_snp_mat2
(genes/others (rows) x patient samples (columns)) - how are these transformed from the above (I think the code is this one). I am particularly interested in interpreting these values, i.e. does lower/negative values correspond to deletion and higher/positive to amplification somehow?
Hi John, @bblodfon
These are Segment_Mean values and are reduced with a weightedmean function. I've updated the documentation with details. waldronlab/TCGAutils@dd53882
I couldn't quickly find the documentation for the Broad Firehose pipeline but I saw that
https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/ has
The GDC further transforms these copy number values into segment mean values, which are equal to log2(copy-number/ 2). Diploid regions will have a segment mean of zero, amplified regions will have positive values, and deletions will have negative values.