Note: When using this resource, please credit Renzo DiNatale, M.D. and Christopher Fong, Ph.D. (MSKCC)
Cancer is a disease that can involve metastatic spread from it primary site (Ex. Lung) to another organ (Ex. Liver). However, the naming conventions used for this organs can slightly vary, or may be specified with more granularity than commonly appropriate. This is evident in free text entries, where the variability is large.
This repository contains functions that will annotate the MSK-IMPACT cohort summary table with a standardized organ site map between primary and metastatic organ sites. The annotator will either leverage various organ site names OR secondard malignant neoplasm ICD billing codes to a standard organ site ontology.
From cBioPortal, a user can download a study summary of a cohort (Example at /demo_data/msk_impact_2017_clinical_data.tsv
), and find the Primary Tumor Site
and Metastatic Site
columns, specifying the specific locations of the primary cancer and, if applicable, the metastatic sequenced sample.
Here, a custom mapping to a standard set of organ sites have been created containing 82 different locations.
Annotations between primary and metastatic sites include:
- Distant or local extention of cancer spread
- Local or distant lymph node metastasis
- Organ type according to hematogenous spread (Liver, lung, portal, non-portal)
Granularity may vary from project to project, and to accommodate, two additional organ mapping profiles have been created to
- Reduce the number of metastatic sites for the purpose of aggregating cases to increase the N in bins
- Have the metastatic sites reflect that of how ICD billings codes for how secondary malignant neoplasms are represented Therefore, three mappings are included in this annotation package:
- A manually curated version, where, for each site in a
- A mapping of tissue types according to oncotree tissue types
- A mapping of metastatic sites based on ICD billings codes on secondary malignant neoplasms
See the readme in /mappings
for comprehensive details on the mapping
There are two input systems can be utilized for this repository:
- Free text as provided in cBioPortal summary tables
- ICD billing codes related to secondary malignant neoplasms or metastatic cancer
This repository is built for Python 3.6+.
Library dependencies:
- Pandas
An annotated version of the input dataset. Additional columns:
Annotation Column | Description |
---|---|
PRIMARY_SITE_RDN_MAP | Standardized Primary Cancer Site |
PRIMARY_SITE_RDN_MAP_MAIN | Main category name from PRIMARY_SITE_RDN_MAP |
PRIMARY_SITE_RDN_MAP_SECONDARY | Secondary category name from PRIMARY_SITE_RDN_MAP |
METASTATIC_SITE_RDN_MAP | Standardized Metastatic Cancer Site |
METASTATIC_SITE_RDN_MAP_MAIN | Main category name from METASTATIC_SITE_RDN_MAP |
METASTATIC_SITE_RDN_MAP_SECONDARY | Secondary category name from METASTATIC_SITE_RDN_MAP |
LYMPH_SPREAD | Annotation if metastatic site is a lymph node, and if it is a regional or distant spread |
LOCAL_EXTENSION | Annotation if metastatic site is a local extension to the the primary site |
hematogenous_grouping | Annotation label for hematogenous metastatic dissemination. Can be either Portal, Non-Portal, Lung, or Liver |
METASTATIC_SITE_ONCOTREE_RDN | Metastatic site annotations from METASTATIC_SITE_RDN_MAP, aggregated to fit oncotree tissue types |
METASTATIC_SITE_BILLING_RDN | Metastatic site annotations from METASTATIC_SITE_RDN_MAP, aggregated to fit ICD Billing codes (Secondary malignant neoplasms) |
path
: Pathname, assuming all files are in a folder
fname_all_sites
: Curated sites connecting to cbioportal study summary (MSK-IMPACT)
fname_hematogenous
: Hematogenous spread mapping based on metastatic site
fname_localext
: Mapping to delineate affected local and regional organs
fname_lymphatic
: Mapping to delineate affected local and distant lymph nodes
fname_site_map
: Conversion map of fname_all_sites to oncotree tissue types
fname_billing_map
: Conversion map of fname_all_sites to sites standardized to ICD billing ontologies
fname_billing_code_dict
: Conversion map of fname_all_sites ICD billing codes
Python object with mapping dataframes as member variables.
df_samples
: Input dataframe containing free-text primary and/or metastatic site names
col_primary_site
: Column name for primary cancer site
col_met_site
: Column name for metastatic cancer site
label_dist_ln
: True or False if distant lymph nodes should be incorporated in mapping
Organ sites using the MSK-IMPACT cohort (Nat. Med. 2017) via cBioPortal
Data from this cohort is located at /demo/msk_impact_2017_clinical_data.tsv
After annotations are added, dataframe is saved to /demo/impact2017_met_site_annotations_impact.csv