SciLifeLab/NGI-RNAseq

Test data URL will go down soon

ewels opened this issue · 12 comments

ewels commented

Currently, the test data for this and our other pipelines is hosted on the UPPMAX milou webexport resource. This is going to disappear pretty soon. UPPMAX have no plans to replace this service anywhere else, so we need to come up with an alternative solution.

Using a project on the SNIC Science Cloud could work (we have a project there). We could also set up a mini sandboxed web server on one of our local servers. Or we could try to use some other kind of public data hosting service (such as GitHub if the data isn't too big?).

Phil

Is your data test that big?

it's 625MB - I just checked

OK, so yeah, that's quite big...
What are the files you need to make up that much?

Seems like the limit for ordinary repos is 100 MB, but you can apparently use LFS with files up to 2 GB: https://git-lfs.github.com/

But then you'll have to install LFS in Travis too ;-)

True, how easy is the Snic cloud to work with?

For CAW, we have a very small data set for testing, and corresponding small references (that we use to build the indexes, and everything).
In the end, it's not that big.
Is a approach like that possible for the RNAseq pipeline?

ewels commented

I think most of the filesize is in the STAR index. So if we build that as part of the tests, we can probably make it a lot smaller...

There is also the possibility of making a container with the references.

Yeah, the STAR-index is most of the file actually:

2.9M Nov 22  2016 SRR4238351_subsamp.fastq.gz
4.4M Nov 22  2016 SRR4238355_subsamp.fastq.gz
4.3M Nov 22  2016 SRR4238359_subsamp.fastq.gz
3.4M Nov 22  2016 SRR4238379_subsamp.fastq.gz
391K Nov 22  2016 genes.bed
11M Nov 22  2016 genes.gtf
12M Nov 22  2016 genome.fa
320B Dec  9  2016 r64
576B Nov 22  2016 star 

That's indeed quite a lot...

ewels commented

This is fixed and done..