yupenghe/methylpy

multi-threading indexing?

Closed this issue · 4 comments

yvnkm commented

I'm running methylpy single-end-pipeline currently and it has been working very well!

Except that indexing is taking a very long time, and it only uses one core even though I set --num-procs 32. Is there a way I can do multithreading for indexing?

Command line used

methylpy single-end-pipeline --read-files $i --sample ${i%%.fq.gz}.non-pbat --forward-ref hg38_methylpy_f --reverse-ref hg38_methylpy_r --ref-fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa --num-procs 32 --trim-reads False --remove-chr-prefix False &>${i%%.fq.gz}.non-pbat.out

Output file with time stamps

Begin splitting reads for 21_R2.non-pbat_libA
Fri Sep 25 08:49:01 2020

No trimming on reads
Fri Sep 25 08:54:00 2020

Begin converting reads for 21_R2.non-pbat_libA
Fri Sep 25 08:54:00 2020

Begin Running Bowtie2 for libA
Fri Sep 25 08:54:18 2020

32115469 reads; of these:
32115469 (100.00%) were unpaired; of these:
14266954 (44.42%) aligned 0 times
11212951 (34.91%) aligned exactly 1 time
6635564 (20.66%) aligned >1 times
55.58% overall alignment rate
Processing forward strand hits
Fri Sep 25 09:31:11 2020

32115469 reads; of these:
32115469 (100.00%) were unpaired; of these:
14350834 (44.69%) aligned 0 times
11210926 (34.91%) aligned exactly 1 time
6553709 (20.41%) aligned >1 times
55.31% overall alignment rate
Processing reverse strand hits
Fri Sep 25 10:12:29 2020

Finding multimappers
Fri Sep 25 10:14:23 2020

[bam_sort_core] merging from 0 files and 32 in-memory blocks...
There are 32115469 total input reads
Fri Sep 25 10:23:09 2020

There are 20282539 uniquely mapping reads, 63.1550453148 percent remaining
Fri Sep 25 10:23:09 2020

Begin calling mCs
Fri Sep 25 10:23:09 2020

Input not indexed. Indexing...
Fri Sep 25 10:23:09 2020

[mpileup] 1 samples in 1 input files
Done
Fri Sep 25 11:50:51 2020

Hey, yes the latest version of samtools allows multi-threading indexing. I just added this new feature to methylpy. I would expect that indexing becomes much faster in the latest version of methylpy.

yvnkm commented

Thanks for the reply! What would be the latest version # of methylpy? Mine is 1.4.3 (installed from Anaconda).

output from methylpy

$ methylpy
usage: methylpy [-h] ...

You are using methylpy 1.4.3 version
(/python3.7/site-packages/methylpy/)

optional arguments:
-h, --help show this help message and exit

functions:

build-reference     Building reference for bisulfite sequencing data
single-end-pipeline
                    Methylation pipeline for single-end data
paired-end-pipeline
                    Methylation pipeline for paired-end data
DMRfind             Identify differentially methylated regions
reidentify-DMR      Re-call DMRs from existing DMRfind result
add-methylation-level
                    Get methylation level of genomic regions
bam-quality-filter  Filter out single-end reads by mapping quality and mCH
                    level
call-methylation-state
                    Call cytosine methylation state from BAM file
allc-to-bigwig      Get bigwig file from allc file
merge-allc          Merge allc files
index-allc          Index allc files
filter-allc         Filter allc file
test-allc           Binomial test on allc file

The latest is 1.4.6. Methylpy can be updated through conda and pip.

You can upgrade the package using pip. Conda is usually late (it still uses the version released 4months ago).

pip install --upgrade methylpy