MG-RAST/Shock

Indexing on uploaded files of certain types

Closed this issue · 0 comments

In shock, when the user uploads a gff file and only wants a subset of data from the file, it is not possible without downloading the complete file (76 Mb) and parsing it.
It would be great if indexing was done on the uploaded file and the index was saved as part of the metadata. Uploaded vcf file for a large sequencing project could be in the range ~100Gb - 600Gb.

Tabix is a software used for indexing files of certain formats including: gff, bed, sam, vcf and psltab and lets user get a subset of data from the file.
eg. gff file has the following format and is used to store information about features on the genome.
(See ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/annotation/Ptrichocarpa_210_gene.gff3.gz for a sample file)

Chr01 phytozome9_0 gene 1660 2502 . - . ID=Potri.001G000100;Name=Potri.001G000100
Chr01 phytozome9_0 mRNA 1660 2502 . - . ID=PAC:27043735;Name=Potri.001G000100.1;pacid=27043735;longest=1;Parent=Potri.001G000100
Chr01 phytozome9_0 CDS 1660 2502 . - 0 ID=PAC:27043735.CDS.1;Parent=PAC:27043735;pacid=27043735
Chr01 phytozome9_0 gene 2906 6646 . - . ID=Potri.001G000200;Name=Potri.001G000200
Chr01 phytozome9_0 mRNA 2906 6646 . - . ID=PAC:27045395;Name=Potri.001G000200.1;pacid=27045395;longest=1;Parent=Potri.001G000200
Chr01 phytozome9_0 CDS 6501 6644 . - 0 ID=PAC:27045395.CDS.1;Parent=PAC:27045395;pacid=27045395

Following is the way I would do it if gff file was on my local system but not sure how to do this in shock.

(grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz;
tabix -p gff sorted.gff.gz;
tabix sorted.gff.gz chr01:6644;