macs3-project/MACS

ATAC-seq settings

igordot opened this issue ยท 29 comments

I am trying to call peaks in ATAC-seq data. According to the documentation:

To find enriched cutting sites such as some DNAse-Seq datasets. In this case, all 5' ends of sequenced reads should be extended in both direction to smooth the pileup signals. If the wanted smoothing window is 200bps, then use '--nomodel --shift -100 --extsize 200'.
For certain nucleosome-seq data, we need to pileup the centers of nucleosomes using a half-nucleosome size for wavelet analysis (e.g. NPS algorithm). Since the DNA wrapped on nucleosome is about 147bps, this option can be used: '--nomodel --shift 37 --extsize 73'.

Based on a brief literature review, people use both --shift -100 --extsize 200 and --shift 37 --extsize 73 for ATAC-seq. Is one option more appropriate? Is there another setting I should be using?

If you followed original protocol for ATAC-Seq, you should get Paired-End reads. If so, I would suggest you just use "--format BAMPE" to let MACS2 pileup the whole fragments in general. But if you want to focus on looking for where the 'cutting sites' are, then '--nomodel --shift -100 --extsize 200' should work.

I have a question towards the BAM and BAMPE mode:

The TLEN of a pair is of course determined by the insert size after the alignment. I understand that in BAMPE, MACS takes only the 5' of the first mate of the pair and uses TLEN for the fragment length. But how is it in BAM mode? Does it treat the two mates idependently, so that literally the coverage of each nt within the pair would be double as high as in BAMPE mode?

@ATpoint, according to the manual:

Pair-end mapping results can be saved in a single BAM file, if so, MACS will automatically keep the left mate(5' end) tag. However, when format BAMPE is specified, MACS will use the real fragments inferred from alignment results for reads pileup.

It sounds like only one of the mates is used.

Yes, but I was referring to what happens with paired-end BAMs when not BAMPE but BAM is specified as -f, as this currently seems to be the method of choice for ATAC-seq in the literature.

Based on that description, it sounds like only one mate (left mate) is used when the reads are paired and BAM (not BAMPE) is specified. Essentially, half the reads are ignored.

Hopefully @taoliu can confirm.

Yes you are right. The more I think about it, it probably autodetects the bitwise flag for paired-end reads and discards the reverse mate for further analysis.

so -f BAMPE is preferred for paired-end ATAC-seq?

@crazyhottommy @igordot Yes. With -f BAMPE on, MACS2 read the left mate and the insertion length information from BAM file, and discard right mate. With -f BAM, MACS2 only keeps the left mate, so it's definitely not a feature you want for your paired-end data.

Why do so many papers use --shift -100 --extsize 200 for MACS2 rather than -f BAMPE if this is not recommended for paired-end data?

Why do so many papers use --shift -100 --extsize 200 for MACS2 rather than -f BAMPE if this is not recommended for paired-end data?

--shift -100 --extsize 200 will amplify the 'cutting sites' enrichment from ATAC-seq data. So in the end, the 'peak' is where Tn5 transposase likes to attack. The fact is that, although many information such as the insertion length and the other mate alignment is ignored, such result is still usable. Especially when the short fragment population is extremely dominant, the final output won't be off much.

Ok thank you. But you would still recommend the -f BAMPE option? After peak calling with -f BAMPE, I then take the 5' end of the forward and reverse reads as the cut sites. Does this make sense?

@taoliu > @crazyhottommy @igordot Yes. With -f BAMPE on, MACS2 read the left mate and the insertion length information from BAM file, and discard right mate. With -f BAM, MACS2 only keeps the left mate, so it's definitely not a feature you want for your paired-end data.

How does --keep-dup 1 works when setting -f BAMPE ? will --keep-dup 1 still deduplicate the fragments ?

Sorry for reopening this discussion, but I have another question regarding ATAC-seq.

I am interested in the cutting sites, so I want to use the --shift and --extsize options. However, as discussed before, we then unfortunately ignore all the mates. I have heard of people just merging the fastqs and treating the paired-end data as single-end to avoid this, however we then have suboptimal alignment.

@taoliu, I am wondering if the assumptions of MACS2 aren't broken if we align PE, and set all RNEXT to *, PNEXT to 0, and shift the mates position by - (minus) TLEN, and use this BAM file for peak calling (with --shift and --extsize).

For recent alternatives have a look at https://github.com/LiuLabUB/HMMRATAC from the Liu lab for ATAC-seq or https://github.com/jsh58/Genrich

If you want to use macs and keep both mates you could write the BAM file as BED with one interval for each of the mates, reduced to the 5' end to get the cutting site and then use -f BED to feed that into macs.

@ATpoint, thanks! Converting from BAM to BED is probably a much simpler approach!

xpan1 commented

@taoliu > @crazyhottommy @igordot Yes. With -f BAMPE on, MACS2 read the left mate and the insertion length information from BAM file, and discard right mate. With -f BAM, MACS2 only keeps the left mate, so it's definitely not a feature you want for your paired-end data.

How does --keep-dup 1 works when setting -f BAMPE ? will --keep-dup 1 still deduplicate the fragments ?

Hi, did you solve your problems about parameter --keep-dup? When I set it as all or 1, the peak numbers are significantly different? Which one is better for ATAC-Seq?

You can use HMMRATAC instead now, which is a substitution of MACS2 spcific designed to ATACseq. It is developed by Tao Liu too.

xpan1 commented

You can use HMMRATAC instead now, which is a substitution of MACS2 spcific designed to ATACseq. It is developed by Tao Liu too.

Thanks, Sounds good! Btw, what bam files do you use? Can you use the bam files after removing the duplicates using Picard?

You can use HMMRATAC instead now, which is a substitution of MACS2 spcific designed to ATACseq. It is developed by Tao Liu too.

Thanks, Sounds good! Btw, what bam files do you use? Can you use the bam files after removing the duplicates using Picard?

Yes, HMMRATAC can run with BAM files after duplicate removal. However, by default, HMMRATAC will remove the duplicates for you using the same Picard process. It is also possible to turn off this behavior with the --rmDup option.

@igordot Hi, I am using MACS2 to call peaks, I am interested in the parameter you used in your article. my data are paired end reads.

Hi, I have quite a naive question regarding the ATAC-seq analysis.

Why do we want to set the --keep-dup all in ATAC-seq. I have seen this setting being used by a lot of people.

My guess:
This setting is turned one because the duplicated alignments have been filtered out before the MACS2 step (i.e. by picard). But then is this case wouldn't --keep-dup all and --keep-dup 1 produce identical results? (This is been somewhat touched in this biostar question)

@yichao-cai --keep-dup all is recommended if the duplicated alignments have been removed in preprocessing steps with other tools. The idea is that we shall not perform the same task with two different tools in our pipeline. For example, if you rely on picard to remove duplicates, then you just leave the task to picard and let macs not to remove duplicates at all. In this way, you will have better control of your pipeline and avoid unexpected errors.

For anyone still following this discussion (more than 4 years later), there is a related thread on Twitter that brings up some interesting points and suggests using "-f BED --shift -100 --extsize 200": https://twitter.com/XiChenUoM/status/1336658454866325506

Thanks, that's very helpful

@igordot I just enabled the 'Discussions' function for MACS on Github: https://github.com/macs3-project/MACS/discussions Hope that the informative discussions in the 'Issues' won't go away because of age :) I can summarize our discussion in this issue and start a new discussion. But if you want to start it, let me know and feel free to do so :) Just don't want to take away your credit.

Another related issue, as I mentioned in Xi Chen's Twitter thread, is #421. As for MACS, the goal is to enable possible ways to analyze the data, so I will add an option (to keep both ends of BAMPE) to make life easier for people who want to focus on the cutting/insertion sites enrichment in ATAC-seq.

@taoliu there is now a discussion: #435

I haven't yet used that GitHub feature, but hopefully it should be more appropriate than an "issue".

Thank you! @igordot

As for MACS, the goal is to enable possible ways to analyze the data, so I will add an option (to keep both ends of BAMPE) to make life easier for people who want to focus on the cutting/insertion sites enrichment in ATAC-seq.

Any update on this @taoliu ? I was just looking through the code, but didn't find the right place where the filtering of 2nd mates occurs.
Also, should this be a change to BAMPE or to plain BAM? If focusing on the cut-sites, the full length of the pairs / fragments does not actually matter, right?