amojarro/carrierseq

Channels not found with grep to $all_reads

Closed this issue · 3 comments

vsoch commented

hey @amojarro ! I'm working on some singularity images (like Docker but safe for HPC) to go along with a publication for an internal container organization format, and was recommended to use your pipeline by one of the community (do you know Pim?) I'm doing well - I have two versions of the container:

https://github.com/vsoch/carrierseq/tree/singularity

but I am hitting a snag. This call:

grep -Eo '_ch[0-9]+_|ch=[0-9]+' $all_reads > $output_folder/06_poisson_calculation/01_reads_channels.lst

returns nothing. I am using the data that you linked, and thinking either it changed or the call with grep should be adjusted. What happens after nothing is found is the python script obviously gets angry when 0 is given for the denominator.

Thanks for your help with this!

Hi @vsoch, It looks like the Sequence Read Archive (SRA) has replaced the original read headers.

Normally, the sequence data would contain either the output information from the Albacore basecaller or from a Poretools fastq conversion command (fast5 > fastq).

For example, Albacore would look like [read ID run ID read channel start_time]:

@cc74d4a9-b62f-4274-86d0-7d95370b6aba runid=55268 read=23015 ch=434 start_time=2017-06-22T17:44:34Z

And Poretools [read ID path/to/fast5]:

@channel_434_cc74d4a9-b62f-4274-86d0-7d95370b6aba_template /Users/mojarro/Documents/Sequencing/Low_Input_Sequencing/minknow_1_5_18/fast5/pass/127/VENUSAUR_20170511_FNFAE22530_MN17220_sequencing_run_sample_id_55268_ch434_read23015_strand.fast5

However, the header information has now been replaced with an SRA ID and only the read ID:

>gnl|SRA|SRR5935058.1 895b5243-42d4-4cc6-8b5b-c29c813bf663_Basecall_1D template (Biological)

Thank you for the comment, I will investigate how to preserve the original metadata on NCBI. In the meantime I have uploaded the original fastq file to dropbox.

https://www.dropbox.com/sh/vyor82ulzh7n9ke/AAC4W8rMe4z5hdb7j4QhF_IYa?dl=0

vsoch commented

Fantastic! Thanks for your quick response and looking into this - I'll give it another try with the updated file, and will keep a lookout from updates from you here. A similar thing happened to me and a colleague with data URLs, and we ultimately opted to serve the data ourselves.