ntanmayee/decoden

`deeptools` `countReadsPerBin` output is not sorted

Closed this issue · 2 comments

The new preprocessing pipeline uses deeptools countReadsPerBin class. This uses multiprocessing and is much faster than before.

However, the output from this is not sorted. This means that two runs of crpb.run() can give different results making the rest of the DecoDen pipeline wrong.

Potential solution --

Re-implement countReadsPerBin.py and pass includeLabels=False to mapReduce. This should return the chromosome, start and end which will help in sorting the multiprocessing output.

The previous solution does not work. This is the new strategy

  1. Read in chrom_sizes.bed file to get chromosome names and lengths
  2. Call count_reads_in_region instead of run. This is still run with multiprocessing, but the results are ordered
  3. Concatenate resulting coverage arrays