ababaian/serratus

Migrate assembly data to lovelywater

Opened this issue · 23 comments

We need to migrate all the assembly and annotation data generated as part of Serratus to our data-lake in a structured way so as to allow for programmatic access. This is a proposed folder hierarchy to discuss wherewe have $SRA as the accession-variable

Similar to the rest of the archive, I propose 'flat' folders broken up by major category and every file contains a $SRA prefex. So no contig/$SRA/$SRA.data.fa or contig/$SRA/data.tsv cases.

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments
├ cov_index.tsv       # Index file of CoV+ libraries
└ assembly_index.tsv  # Index file of assembled SRA libraries

assembly/cov/$SRA.cov.fa : Contigs identified to be CoV (i.e. 12K paper is based on)

  • Currently in : s3://serratus-public/assemblies/contigs/
  • Do not include 0B or empty files

contigs/ : The coronaSPAdes output files such as $SRA.inputdata.txt, $SRA.coronaspdes.txt, $SRA.coronaspdes.gene_clusters.fa ... $SRA.coronaspdes.assembly_graph_with_scaffolds.gfa.gz

  • Currently as s3://serratus-public/assemblies/other/$SRA.coronaspades/$SRA...
  • Remove $SRA.coronaspades/ intermediate folder

annotation/

  • Currently as s3://serratus-public/assemblies/annotations/

gz/ : I was originally thinking of also storing the data as a single $SRA.tar.gz file containing cov/ contig/ and annotation/ data but this will duplicate the data and is probably not a good idea. Instead we can provide a short grabSRA.sh $SRA script which will automatically download all the files associated with a particular $SRA to the local system for users.

it's all staged in s3://serratus-rayan/lovelywater/assembly, please have a look before transferring to lovelywater.

Name Size
annotation/ 73.8 GB
cov/ 169.2 MB
contigs/ 4.0 TB

TODO for me next:

  • quenya, dicistro, satellites CS assemblies into contigs/
  • update access data release page

The README.md in the top-level of lovelywater is out-of-sync with the bucket directory structure.

Most recent version is always on the Data Access Page

That page is also inconsistent. In Naming Conventions, it uses as an example, s3://lovelywater/contig/SRA123456.fa. In the Folder Organization section, there is no such folder contig, and there is no such directory in the bucket (as far as I can see).

The data for assemblies has not been migrating on it, once that's done it closes this issue.

edit: updated the access page to reflect situation on the ground

Satellites assemblies have been migrated, to s3://serratus-rayan/lovelywater/assembly/contigs i.e. same location as other CoV assembly data.
For some reason, I can't find satellites' scaffolds.fasta files, only the gene_clusters.fasta are present. I tend to think I might have never copied scaffolds.fasta to S3 (likely due to a past bug that has recently been fixed) and it's likely that we were only interested in gene_clusters.fasta during the satellite analysis.

c'est la vie. Is this the complete collection of assemblies then?

nope, i'm in the process of moving dicistro/quenya assemblies too, will let you know when it's over

done! dicistro, quenya, satellites assemblies are copied.

total number of accessions assembled in s3://serratus-rayan/lovelywater/assembly/contigs: 56,071
total size of ̀s3://serratus-rayan/lovelywater/: 4.9 TB
scaffolds from CoV assemblies (MFC-compressed): 0.9 TB
scaffolds from other assemblies (gzip-compressed): 0.2 TB
assembly graphs (gzip-compressed): 1.6 TB
(These could be deleted, but at the same time keeping them would enable to quickly regenerate assemblies e.g. after a coronaSPAdes update, or to get the missing scaffolds.fasta files)

Darth annotations of checkv-filtered gene_clusters (gzip-compressed): 2.0 TB
Some of those somehow made their way to the contigs/ folder. Among these, some contain a huge BAM file of reads aligned to contigs, hence the space usage. This was needed for quality control. They could be deleted, as for each of those there is another gzip file without the BAM file. Two options:

  1. delete the large BAM-containing Darth archives and move the small ones to into annotation/ folder
  2. keep everything and move all darth stuff to annotation/ folder
    any preference?

Also there is the 1k subset of accession assemblies found by the .pro analysis, wanna include it?

yes

1ksubset: migration done

after some Slack discussions:

  • darth data inside contigs/ has been deleted as it's mainly redundant with the one aleady in ̀annotation/ except for huge BAM files.
  • serratax/serraplace stuff inside contigs/ has been moved to annotation/

so I think we're done

hold on, i'll also move checkV analysis from contigs/ to annotation/

done! Here's the final content of

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments

as staged in s3://serratus-rayan/lovelywater/assembly/.

assembly/cov:

These are the 11,120 coronavirus assemblies made with coronaSPAdes, where contigs have been filtered either using CheckV or using coronaSPAdes' bgc-statistics. See Serratus' manuscript for more details.

assembly/contigs:

SRRXXXXXX.[assembler].assembly_graph_with_scaffolds.gfa.gz
SRRXXXXXX.[assembler].bgc_statistics.txt
SRRXXXXXX.[assembler].contigs.fa.mfc
SRRXXXXXX.[assembler].domain_graph.dot
SRRXXXXXX.[assembler].gene_clusters.fa
SRRXXXXXX.[assembler].scaffolds.fasta.gz
SRRXXXXXX.[assembler].scaffolds.paths
SRRXXXXXX.[assembler].log
SRRXXXXXX.[assembler].txt

All of these are [assembler] outputs, where [assembler] is either coronaSPAdes or rnaviralSPAdes.
Depending on the assembler, a subset of these files will be present for each accession.
Beware: contigs.fa.mfc actually contains the content of coronaSPAdes' scaffolds.fasta compressed with MFCompress.

assembly/annotation:

This folder contains the annotation results of several programs applied to different inputs.

CheckV applied to the scaffolds.fasta and/or gene_clusters.fasta:

SRRXXXXXX.[assembler].checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].checkv.quality_summary.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.quality_summary.tsv.gz

serraplace (taxonomic placement) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serraplace.tar.gz
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.final

serratax (taxonomic identification) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.tar.gz

Then, the following are annotations of the assemblies in cov/. They include the outputs of Darth, a pipeline created within Serratus for annotation of coronavirus assemblies.

SRRXXXXXX.fa.darth.alignments.fasta
SRRXXXXXX.fa.darth.alignments.sto
SRRXXXXXX.fa.darth.input_md5
SRRXXXXXX.fa.darth.stripped.tar.gz
SRRXXXXXX.fa.darth.tar.gz
SRRXXXXXX.fa.darth.transeq.alignments.fasta
SRRXXXXXX.fa.serraplace.tar.gz
SRRXXXXXX.fa.serratax.final
SRRXXXXXX.fa.serratax.tar.gz

I'll begin data migration shortly!

Take a look at s3://lovelywater/assembly/ and let me know if that looks alright.

Also updated the

If that looks good then close this baby!

What's the status on this? Should I be pulling data from s3://serratus-rayan/lovelywater/assembly/cov/ or s3://lovelywater/assembly/cov/?

either is fine they are identical. Migration is now complete. I think we're good to close this @rchikhi

Same number of files and size as my folder, looks good

Total Objects: 671859
   Total Size: 3.2 TiB

so, this issue is closed yet I noticed today that we never deleted anything off the original location s3://serratus-public/assemblies (thought the staged location s3://serratus-rayan/lovelywater got correctly cleared). The original location still contains all the migrated data + some other less useful and non-migrated accessions, like those with partially failed assemblies, a few minia assemblies that coronaspades didn't assemble, etc. I see 48268 coronaspades assemblies on lovelywater and 51756 coronaspades folders on serratus-public (with possibly empty in some cases).
@ababaian, a few options:

  1. delete from s3://serratus-public/assemblies only the migrated stuff
  2. delete everything from s3://serratus-public/assemblies
  3. keep s3://serratus-public/assemblies for some reason

I'd go for 1)

One consideration is serratus-public currently has version control, so you have to do a 2-pass deletion (delete file, and delete history) to remove data. We do need to do this this but I've been delaying until the paper is "done" so we don't whoopsy and lose some data we need. I'll re-open and let's go with (2) once the paper is done is my take. I'll reopen the issue.