wheretrue/biobear

merge paired-end sequences

abearab opened this issue · 5 comments

@tshauck – hi there

I just made some progress in this PR ArcInstitute/ScreenPro2#40. You may see my codes in cas12 module which could be improved. Currently my code depends on something like this process_fastq.sh and it would be ideal to do all that using biobear – i.e. using features as discussed #103 and #116 (here).

I already uploaded toy data after R1/R2 merge in ScreenPro2; I can also upload the original files if that helps.

Hey, nice progress! And thanks for sharing the files. Let me follow up this weekend after I've had a chance to look at #103 and #105 in the context the cas12 files.

Also, a mildly unrelated side note, it's funny to see cas12 again, at my previous employer I did metagenomics discovery and one of the things we looked for were type V systems.

@abearab, I was looking at this a bit, and wanted to see what you thought an ideal interface for you would be here?

E.g.

SELECT *
FROM merge_paired_end_reads('path/to/read_1.fastq', 'path/to/read_2.fastq', 'ADAPTER1', 'ADAPTER2')

Or for,

SELECT *
FROM merge_reads('path/to/read_1.fastq', 'ADAPTER1')

Is this what you had in mind for doing it all in biobear, or if you have other thoughts maybe you could sketch out some pseudo code for your ideal solution? Thanks!

Hi @tshauck – I found this cartoon and it may give you a better sense for merging read pairs.

image


I think your merge_paired_end_reads need to use one of existing algorithms (e.g. PEAR, FLASH, etc.) to check for merging R1/R2. Happy to discuss more but I think reading docs in these tools is more useful for you. I've only used them so I'm not the right person to explain the details. Looking forward to see more from biobear!

This is helpful, thanks! It looks like bbmerge may have a nice paper describing their approach: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657622/