PayneLab/cptac

List available versions of datasets

Closed this issue · 2 comments

Hey there,

this is a great package!

There is a function cptac.list_datasets. However this does not display the available versions of each dataset. How can I list all available versions of a dataset, so that I can decide which ones to download?

Also to which datasets on the https://cptac-data-portal.georgetown.edu protal do these versions correspond to? There are CPTAC, CPTAC2 and CPTAC3 releases. For example following your tutorial1 I find that the Endometrial dataset is version 2.1.1. Does that mean it is from the CPTAC2 release?

Looking forward to your reply!
Best wishes,
Paula

Hey Paula,

great questions.

you should always just download the current version (the default).

The data on the CPTAC portal contains both the raw MS data, and the protein quant done by a harmonized pipeline. The data in our tool is the published version of the quant tables. This differs from what is done by the harmonized pipeline slightly as each publication did some active curation of both pipelines and filtering.

The datasets are from both CPTAC 2 and 3. Each published dataset should have a pubmed link to help you identify the right manuscript. as a side note the version number, e.g. Endometrial 2.2.1, has nothing to do with the CPTAC 2 vs 3.

And just to clarify one thing: The reason we provide multiple versions of each dataset is that every so often, there are minor updates to each dataset. The pipeline might have been improved, or another data type might have been added. Each time there's a change to a dataset, we release the update as a new version of that dataset. So the most recent version is always the best one. We just provide the old versions so that if someone had already performed an analysis with an old data version, they can still access that old data after there's an update.