kids-first/kf-api-dataservice

New Data Types & File Formats

baileyckelly opened this issue · 4 comments

Below are the rules for the new data types/file formats needed to support the various somatic data.

  1. Data Type: Annotated Somatic Mutations / File Format: maf & vcf & tbi
  2. Data Type: Gene Expression / File Format: tsv
  3. Data Type: Isoform Expression / File Format: tsv
  4. Data Type: Gene Fusions / File Formats: tsv & pdf
  5. Data Type: Somatic Copy Number Variations / File Formats: tsv & png
  6. Data Type: Somatic Structural Variations / File Formats: vcf & tbi

@allisonheath @yuankunzhu Please confirm ASAP that this covers all the new data we are ingesting.

How does one determine the data type for these files btw? I don't know if these files are products of harmonization. If they are then maybe data type can be tagged on the S3 object?

Oh, these data types are associated with somatic mutation analysis. So, when we have to ingest these files, a manifest from Cavatica will be provided and we can use these mappings for that manifest specifically.

In general they are the output of a specific workflow type - which is a separate ticket for a field we need to add. For the moment for getting CBTTC and PNOC loaded Yuankun's team has them annotated in the Cavatica data holder project for us.

I did a quick survey, Expression is never plural. So let's update
Gene Expressions -> Gene Expression
Isoform Expressions -> Isoform Expression

Updated ticket description.