lmdu/pyfastx

Does pyfastx split fasta files sequentially?

Closed this issue · 7 comments

I want to split paired-end fastq files, if I run pyfastx split -n 30 on each will the pair order be preserved such that subset 1 of forward reads and subset 1 of reverse reads will be correctly paired?

I tried running split and got the following error:

python: malloc.c:2401: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.

lmdu commented

I have fixed the memory malloc issue in new version. pyfastx can not split FASTA file sequentially. But pyfastx sequentially split FASTQ file. If you use pyfastx split paired-end fastq file. You should insure the two paired files have the same read counts and reads are paired in the same line position.

So, pyfastx output is deterministic? That is paired-end files with reads in the paired-order and the same number of reads will have the same subsets by position in each of the N files produced by split? And these will be deterministic but not sequential subsets?

Just re-read your comment didn't notice the difference between FASTA and FASTQ. Why make an index if you are just splitting FASTQ sequentially?

lmdu commented

Thank you for your suggestion. It's really not necessary to build index prior to splitting FASTQ file. This will be fixed in later version.

You could use the functionality to allow random sampling of the FASTQ file from the command line. I would find this very useful.

lmdu commented

I will consider your suggestion and implement a functionality for random sampling reads from FASTQ file.