marbl/CHM13

Extracting a subset of data from raw nanopore signal data

hasindu2008 opened this issue · 5 comments

I was looking for an ONT raw signal dataset at very high coverage (a few 100X) and the nanopore dataset in this repository seems to be ideal. It is just a few genomic regions that I need the raw data for. Is there a way to selectively download a set of read IDs from the raw dataset, without having to download and extract all the terabytes of tar.gz (which I estimate to take weeks-months)?

Unfortunately, we don't have the data organized by chromosome so your only option would be to download and extract the full set. If you have IDs of the reads you're interested in and post them here, I can try to look up which partitions they are in and you can download just those.

As the reads seemed to be distributed all throughout the partitions (and I would have to iteratively try different subsets), I ended up downloading the whole thing and after like 2 weeks it has fully downloaded! Now extracting all and hopefully, the file system can handle a large number of files. Let you know how it goes. This is an exciting dataset.

It'd be really useful to have fast5 files sorted by chromosome/position. That'd be a lot of effort to set up, though.

@gringer When it is in FAST5 - yes every manipulation task is hard.
I have successfully converted all the partitions into BLOW5 recently and now any type of sorting is now a few bash commands. I would be able to provide such sorting if you are interested.

@skoren Do you have the total number of reads in the dataset?
After conversion to BLOW5, the total size was reduced to 3.4TB, which was originally 5.2TB in compressed FAST5 tar.gz archives. This is to double-check if all the reads are present in the converted version.