mims-harvard/TDC

Expand cancer cell line and patient related datasets, e.g. DepMap, CCLE, TCGA

abearab opened this issue · 5 comments

Describe the problem
To enable cancer research, I would like to suggest including functionalities to work with cancer cell line information in TDC. In DepMap, there are updates in newer DepMap releases that make it incompatible with some current implementations for data collection – e.g. GilbertLabUCSF/CanDI#34, kevinhu/cancer_data#79.

Our previous work, CanDI, is a global cancer data integrator in Python that is used to harmonize and query datasets. Stable data from prior DepMap releases is deposited in Harvard Dataverse. I drafted some scripts to download this older but functinoal data and we will update script to make it work with newer DepMap releases.

TCGA data access can be even harder, although I just saw https://cloud.google.com/life-sciences/docs/resources/public-datasets/tcga

Describe the solution you'd like
A new data collection method will be very beneficial. It would be great to gather structured and harmonized data for cancer cell lines using TDC. You already have a tool for GDSC so a similar approach for CCLE and DepMap will be very useful. gget is also planning something like this which can be a synergized effort pachterlab/gget#121 (cc @lauraluebbert).

Additional context
See these links for CanDI's source codes https://github.com/GilbertLabUCSF/CanDI, docs or manuscript

This is an example of my analysis using TDC and CanDI – notebook | blog post | GilbertLabUCSF/Decitabine-treatment#5


other related issues: #191

Thanks for the issue! This sounds interesting. Would it make sense to add this as an additional dataset for the drug response prediction task? https://tdcommons.ai/multi_pred_tasks/drugres/ Or are you thinking more as an independent data function as in https://tdcommons.ai/fct_overview/?

Hi @kexinhuang12345, I think DepMap and CCLE datasets are multi-modal readouts form different assays performed on cancer cells and these are / can be used in many different tasks. Thus, maybe this can be a "Data Processing" from "Data Functions"?

Hi @kexinhuang12345 - quick question. Have you ever thought about including tasks related to connecting cancer cell line to cancer patients? e.g. https://github.com/broadinstitute/celligner

Hi @kexinhuang12345,

Interesting! What is the relevant machine learning task formulation for it?

I think there is a wide range of ML tasks possible with the CCLE and DepMap datasets, here are some examples:

As for the data function/dataset for this cancer cell line data, I was thinking more about it and it seems like it maybe more fit as datasets since the data functions in general need to be applicable to multiple tasks&datasets in contrast to be dataset-specific.

Agreed.

The function to generate the datasets are definitely useful we should store it in the data generation repo and reuse it or even make it into the data loader for more diverse usage.

I guess I'm not aware of the "data generation repo". Let me know how I can help in this regard.

What are your thoughts on this? Also you mention about multiple tasks, can you elaborate more on this?

In general, CCLE stands for Cancer Cell Line "Encyclopedia" so conceptually it is a well-established empirical resource for a diverse set of biological questions. Thus, these datasets are widely used for simple query tasks or more advanced ML tasks in the context of cancer cell biology.

Happy to hop on a call to discuss more and let me know, thanks!!

I'll send an email right after this, thank you.