{cbioportalR} allows you to access cBioPortal’s genomic and clinical data directly through R. The package wraps cBioPortal’s existing API endpoints in R so R users can easily leverage cBioPortal’s API. Using these functions, you can access genomic data on mutations, copy number alterations and fusions as well as data on tumor mutational burden (TMB), microsatellite instability status (MSI) and select clinical data points (depending on the study).
This package was created to work with both the public cBioPortal website, as well as MSK’s private institutional cbioportal database. To connect to a private database, you must first get an access token (or whatever credentials your institution requires) and supply the specific API url at the beginning of your session (details below).
For more information on cBioPortal, see the following publications:
For full documentation on the cBioPortal API, please see the following links:
You can install the development version of {cbioportalR} with:
remotes::install_github("karissawhiting/cbioportalR")
If you are using the public domain https://www.cbioportal.org/, you do
not need a token to access data. If you are using a private instance of
cbioportal (like MSK’s institutional database), you will need to acquire
a token and save it to your .Renviron
file (or wherever you store
credentials).
Simply log in to your institution’s cbioportal website, acquire a token
(Usually through the ‘Web API’ tab) and save it in your .Renviron
file. This will save the token as an environmental variable so you do
not have to hard code the secret key in your scripts.
Tip: The following {usethis} function can easily find and open the
.Renviron
for you:
usethis::edit_r_environ()
Paste the token you were given (using the format below) in the .Renviron file and save the file changes. You may want to save and restart your R session to ensure the token is saved and recognized.
CBIOPORTAL_TOKEN = 'YOUR_TOKEN'
You can test that your token was saved using:
library(cbioportalR)
get_cbioportal_token()
To reiterate, if you are planning to retrieve data using public cBioPortal, you do not need a token. If you need to access data on an institutional cBioPortal page you must get a token first.
Note: If you are a MSK researcher working on IMPACT, you should connect to MSK’s cBioPortal instance to get the most up to date IMPACT data, and you must follow MSK-IMPACT publication guidelines when using the data
For every new R session, you need to set your database URL. The
get_cbioportal_db()
function will set an environmental variable for
your session that tells the package which database to point to for all
API calls.
You can set it to point to the public database with this shortcut:
library(cbioportalR)
get_cbioportal_db("public")
or you can set it to a specific institution database with:
get_cbioportal_db("<<your institution's url>>/api")
Once you’ve set your preferred db connection, you can pull data via study ID or sample ID.
To see available studies (this depends on what cBioPortal database you are connected to), you can use:
get_studies() %>% head(n = 10)
#> # A tibble: 10 x 13
#> name shortName description publicStudy pmid citation groups status
#> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <int>
#> 1 Oral Sq… Head & ne… Comprehensive … TRUE 2361… Pickerin… "" 0
#> 2 Hepatoc… HCC (Inse… Whole-exome se… TRUE 2582… Schulze … "PUBL… 0
#> 3 Uveal M… UM (QIMR) Whole-genome o… TRUE 2668… Johansso… "PUBL… 0
#> 4 Neurobl… NBL (AMC) Whole genome s… TRUE 2236… Molenaar… "PUBL… 0
#> 5 Nasopha… NPC (Sing… Whole exome se… TRUE 2495… Lin et a… "PUBL… 0
#> 6 Neurobl… NBL (Colo… Whole-genome s… TRUE 2646… Peifer e… "" 0
#> 7 Myelody… MDS (Toky… Whole exome se… TRUE 2190… Yoshida … "" 0
#> 8 Insulin… Panet (Sh… Whole exome se… TRUE 2432… Cao et a… "" 0
#> 9 Pleural… PLMESO (N… Whole-exome se… TRUE 2548… Guo et a… "" 0
#> 10 Pilocyt… PAST (Nat… Whole-genome s… TRUE 2381… Jones et… "PUBL… 0
#> # … with 5 more variables: importDate <chr>, allSampleCount <int>,
#> # studyId <chr>, cancerTypeId <chr>, referenceGenome <chr>
To pull mutation data for a particular study ID you can use:
# As a result you will get a list of dataframes of 1) mutation + fusion and 2) cna.
df <- get_genetics(study_id = "nbl_amc_2012",
mutations = TRUE,
cna = FALSE,
fusions = TRUE)
mutations <- df$mut
df %>% head()
#> $mut
#> # A tibble: 562 x 31
#> uniqueSampleKey uniquePatientKey molecularProfil… Tumor_Sample_Ba… patientId
#> <chr> <chr> <chr> <chr> <chr>
#> 1 TjU5NVQ6bmJsX2F… TjU5NTpuYmxfYW1… nbl_amc_2012_mu… N595T N595
#> 2 TjYwOFQ6bmJsX2F… TjYwODpuYmxfYW1… nbl_amc_2012_mu… N608T N608
#> 3 TjcxOFQ6bmJsX2F… TjcxODpuYmxfYW1… nbl_amc_2012_mu… N718T N718
#> 4 TjU3MlQ6bmJsX2F… TjU3MjpuYmxfYW1… nbl_amc_2012_mu… N572T N572
#> 5 Tjc0NFQ6bmJsX2F… Tjc0NDpuYmxfYW1… nbl_amc_2012_mu… N744T N744
#> 6 TjU2MVQ6bmJsX2F… TjU2MTpuYmxfYW1… nbl_amc_2012_mu… N561T N561
#> 7 TjU0OFQ6bmJsX2F… TjU0ODpuYmxfYW1… nbl_amc_2012_mu… N548T N548
#> 8 TjU3MlQ6bmJsX2F… TjU3MjpuYmxfYW1… nbl_amc_2012_mu… N572T N572
#> 9 TjU3NVQ6bmJsX2F… TjU3NTpuYmxfYW1… nbl_amc_2012_mu… N575T N575
#> 10 TjUwOFQ6bmJsX2F… TjUwODpuYmxfYW1… nbl_amc_2012_mu… N508T N508
#> # … with 552 more rows, and 26 more variables: entrezGeneId <int>,
#> # studyId <chr>, center <chr>, Mutation_Status <chr>, validationStatus <chr>,
#> # startPosition <int>, endPosition <int>, referenceAllele <chr>,
#> # proteinChange <chr>, Variant_Classification <chr>,
#> # functionalImpactScore <chr>, fisValue <dbl>, linkXvar <chr>, linkPdb <chr>,
#> # linkMsa <chr>, ncbiBuild <chr>, Variant_Type <chr>, keyword <chr>,
#> # chr <chr>, variantAllele <chr>, refseqMrnaId <chr>, proteinPosStart <int>,
#> # proteinPosEnd <int>, HGVSp_Short <chr>, Protein_position <int>,
#> # Hugo_Symbol <chr>
#>
#> $cna
#> NULL