Matching pancancer expression to metadata
Closed this issue · 4 comments
Hi @gwaygenomics @cgreene
I would like to map samples (index values) in pancan_scaled_rnaseq.tsv.gz
to the metadata tcga-clinical_data.tsv
.
Currently the index values for the rnaseq are not unique and I am unable to match them to the metadata.
Could you please advise on how, for example, I could subset the pancancer data to just a single cancer subtype (tying it to the metadata).
pancan_scaled_rnaseq.tsv.gz
includes sample level information while tcga-clinical_data.tsv
includes patient level information. The sample level information identifiers are much more descriptive than the patient level information.
Mapping between the two files can be done by subsetting TCGA barcodes. Info here: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/
This file might also be helpful: https://github.com/greenelab/pancancer/blob/master/data/sample_freeze.tsv
in either direction, the mapping can be done the same way. I don't think i've used the portion_id
column though. Is there a sample_id
column or something similar?
This file: https://github.com/greenelab/pancancer/blob/master/data/sample_freeze.tsv
made it trivial. Thanks