`deeptools` `countReadsPerBin` output is not sorted

Question

`deeptools` `countReadsPerBin` output is not sorted

Closed this issue 7 months ago · 2 comments

The new preprocessing pipeline uses deeptools countReadsPerBin class. This uses multiprocessing and is much faster than before.

However, the output from this is not sorted. This means that two runs of crpb.run() can give different results making the rest of the DecoDen pipeline wrong.

Answer 1 · 2024-03-05T17:20:12.000Z

Potential solution --

Re-implement countReadsPerBin.py and pass includeLabels=False to mapReduce. This should return the chromosome, start and end which will help in sorting the multiprocessing output.

Answer 2 · 2024-03-06T13:58:23.000Z

The previous solution does not work. This is the new strategy

Read in chrom_sizes.bed file to get chromosome names and lengths
Call count_reads_in_region instead of run. This is still run with multiprocessing, but the results are ordered
Concatenate resulting coverage arrays