nf-core/chipseq

Test for chromap failing

JoseEspinosa opened this issue · 6 comments

The test for chromap are failing with the exception below:

Caused by:
  Process `NFCORE_CHIPSEQ:CHIPSEQ:MACS2_CALLPEAK (SPT5_T0_REP1)` terminated with an error exit status (1)

Command executed:

  macs2 \
      callpeak \
      --keep-dup all --broad --broad-cutoff 0.1 \
      --gsize 11624332 \
      --format BAMPE \
      --name SPT5_T0_REP1 \
      --treatment SPT5_T0_REP1.mLb.clN.sorted.bam \
      --control SPT5_INPUT_REP1.mLb.clN.sorted.bam
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_CHIPSEQ:CHIPSEQ:MACS2_CALLPEAK":
      macs2: $(macs2 --version | sed -e "s/macs2 //g")
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  4ca545ee6d5d: Already exists
  ce477abda4ec: Pulling fs layer
  ce477abda4ec: Verifying Checksum
  ce477abda4ec: Download complete
  ce477abda4ec: Pull complete
  Digest: sha256:b8506329f67a88c4c9c0e433028c25ba5c29b199a8936f629a02c40a4ecf1942
  Status: Downloaded newer image for quay.io/biocontainers/macs2:2.2.7.1--py38h4a8c8d9_3
  INFO  @ Wed, 27 Jul 2022 13:20:04: 
  # Command line: callpeak --keep-dup all --broad --broad-cutoff 0.1 --gsize 11624332 --format BAMPE --name SPT5_T0_REP1 --treatment SPT5_T0_REP1.mLb.clN.sorted.bam --control SPT5_INPUT_REP1.mLb.clN.sorted.bam
  # ARGUMENTS LIST:
  # name = SPT5_T0_REP1
  # format = BAMPE
  # ChIP-seq file = ['SPT5_T0_REP1.mLb.clN.sorted.bam']
  # control file = ['SPT5_INPUT_REP1.mLb.clN.sorted.bam']
  # effective genome size = 1.16e+07
  # band width = 300
  # model fold = [5, 50]
  # qvalue cutoff for narrow/strong regions = 5.00e-02
  # qvalue cutoff for broad/weak regions = 1.00e-01
  # The maximum gap between significant sites is assigned as the read length/tag size.
  # The minimum length of peaks is assigned as the predicted fragment length "d".
  # Larger dataset will be scaled towards smaller dataset.
  # Range for calculating regional lambda is: 1000 bps and 10000 bps
  # Broad region calling is on
  # Paired-End mode is on
   
  INFO  @ Wed, 27 Jul 2022 13:20:04: #1 read fragment files... 
  INFO  @ Wed, 27 Jul 2022 13:20:04: #1 read treatment fragments... 
  INFO  @ Wed, 27 Jul 2022 13:20:04: 42941 fragments have been read. 
  INFO  @ Wed, 27 Jul 2022 13:20:04: #1.2 read input fragments... 
  INFO  @ Wed, 27 Jul 2022 13:20:05: 80938 fragments have been read. 
  INFO  @ Wed, 27 Jul 2022 13:20:05: #1 mean fragment size is determined as 0.0 bp from treatment 
  INFO  @ Wed, 27 Jul 2022 13:20:05: #1 note: mean fragment size in control is 0.0 bp -- value ignored 
  INFO  @ Wed, 27 Jul 2022 13:20:05: #1 fragment size = 0.0 
  INFO  @ Wed, 27 Jul 2022 13:20:05: #1  total fragments in treatment: 42941 
  INFO  @ Wed, 27 Jul 2022 13:20:05: #1  total fragments in control: 80938 
  INFO  @ Wed, 27 Jul 2022 13:20:05: #1 finished! 
  INFO  @ Wed, 27 Jul 2022 13:20:05: #2 Build Peak Model... 
  INFO  @ Wed, 27 Jul 2022 13:20:05: #2 Skipped... 
  INFO  @ Wed, 27 Jul 2022 13:20:05: #3 Call peaks... 
  Traceback (most recent call last):
    File "/usr/local/bin/macs2", line 653, in <module>
      main()
    File "/usr/local/bin/macs2", line 51, in main
      run( args )
    File "/usr/local/lib/python3.8/site-packages/MACS2/callpeak_cmd.py", line 256, in run
      peakdetect.call_peaks()
    File "MACS2/PeakDetect.pyx", line 107, in MACS2.PeakDetect.PeakDetect.call_peaks
    File "MACS2/PeakDetect.pyx", line 155, in MACS2.PeakDetect.PeakDetect.__call_peaks_w_control
  ZeroDivisionError: float division

Work dir:
  /home/runner/work/chipseq/chipseq/work/e2/86a656a96d88be7c10e969517fe145

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

To me, it seems like the fragment size can not be correctly obtained from the bam files. However, the test with the small dataset work with the rest of the aligners (BWA, STAR and Bowtie2). The problem seems to be related to the pairing of the reads performed by chromap since I have run the full test with chromap (there we have single-end reads) without encountering this problem.

The command that the pipeline uses for aligning reads with chromap is:

chromap -l 2000 --SAM -t 2 -x genome.index -r genome.fa -1 SPT5_INPUT_REP2_T1_1_val_1.fq.gz -2 SPT5_INPUT_REP2_T1_2_val_2.fq.gz -o SPT5_INPUT_REP2_T1.Lb.sam

After discussion in Slack with @lextallan seems here seems that the problem is the TLEN column in the files produced by chromap is empty: "Main issue for running chromap aligned reads through MACS2 appeared to be the TLEN column in the sam/bam files is blank (or all 0s specifically). Could be wrong here, though. I asked the author of chromap about it (here: haowenz/chromap#99). Long story short, chromap isn't designed for outputting bams but they're working on it for a future release".
To fix the issue momentarily, after discussion with @drpatelh, we will filter the channels after running Chromap only to include single-end data throwing a warning for paired-end data. Once the bug becomes fixed this filter will be removed.

This has been partially fixed in #290 by filtering out paired-end data when Chromap is used. However, once a new version of Chromap is released that output usable bam files, this issue might need to be revisited.

Some update: TLEN is recently added into Chromap's SAM output in the latest release.

Thanks for the update @swiftgenomics will give it a try!

Closes in #338