bioinfologics/sdg

SDG.LinkedReadsDatastore_build_from_fastq never ends

Closed this issue · 2 comments

SDG.LinkedReadsDatastore_build_from_fastq runs and creates chunks forever.
140Mb of linked reads procudes 90+Gb of chunks

in python

import pysdg as SDG
ws = SDG.WorkSpace()
ws.sdg.load_from_gfa('./initial_graph.gfa')
SDG.LinkedReadsDatastore_build_from_fastq("./lirds.lirds", "li_reads", "./child/child-link-reads_R1.fastq", "./child/child-link-reads_R2.fastq", SDG.LinkedReadsFormat_raw, readsize=250, chunksize=10000000)

in bash

ls sorted_chunk_*
-rw-r--r--   1 ggarcia  NR4\Domain Users   4.7G 17 Jul 14:32 sorted_chunk_0.data
-rw-r--r--   1 ggarcia  NR4\Domain Users   4.7G 17 Jul 14:33 sorted_chunk_1.data
-rw-r--r--   1 ggarcia  NR4\Domain Users   4.7G 17 Jul 14:33 sorted_chunk_2.data
-rw-r--r--   1 ggarcia  NR4\Domain Users   4.7G 17 Jul 14:34 sorted_chunk_3.data
-rw-r--r--   1 ggarcia  NR4\Domain Users   4.7G 17 Jul 14:34 sorted_chunk_4.data
-rw-r--r--   1 ggarcia  NR4\Domain Users   4.7G 17 Jul 14:34 sorted_chunk_5.data
-rw-r--r--   1 ggarcia  NR4\Domain Users   1.9G 17 Jul 14:35 sorted_chunk_6.data
...

Until the disk is full

Changing the type to SDG.LinkedReadsFormat_seq solves the never-ending problem but creates a corrupted DS as in #105

fixed alongside #105