HuffordLab/NAM-genomes

Filter ACRs with Tn5 integration site densities?

yjiang296 opened this issue · 3 comments

Hi! I am trying to use your ATAC-seqs produced in your article "De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes". I noticed that you uploaded the narrowpeak file to CyVerse, and to my surprise, after your refinement, B73 and OH43's ACRS number can be controlled to about 50 thousand.
I tried the same parameters and softwares you mentioned in supplement material, and I found that B73 and OH43's initial ACRs called by MACS2 turn out to be about one hundred thousand. And I guess that you must filtered about 50 thousand ACRs of B73 and OH43 with Tn5 integration site densities. I didn't find any related codes showing how to do this but only a few descriptive sentence "Tn5 integration site density (insertions per kb) was estimated for each ACR, as well as matched randomized control regions (established using bedtools v2.27.1 shuffle specifically excluding ACRs from the randomized selection space). ACRs with Tn5 integration site densities less than the 99.9% upper quantile of randomized control regions were removed from the analysis. " Would you please show me the related codes or explain the general thought to me?
Really looking forward to hearing from you! Best wishes.(This is my first time asking an issue in English and I don't write English quite often, if I used some wrong words or may offended you , please forgive me.)

Hi @yjiang296:
Thanks for your question! I'm guessing that this section was handled by Alex Marand (@plantformatics), so I will tag him to see if he can answer your question.
Sorry about the lacking scripts - it was a daunting task to put together everything in a single place, and we obviously are missing some key scripts here and there. Also sorry that I can't answer your question right away. I'll inquire around and update you once I have the answer.
PS: you don't have to apologize; your English is fantastic! 💯

Thanks,

@aseetharam I actually can't find any of the scripts from the updated analysis (we probably forgot to merge branches?), but since filtering ACRs using empirical FDR is quite simple, ill just post the code here. All you need to run this analysis is your true set of ACRs derived from MACS2 and the same number of random regions as controls. I use bedtools shuffle with the MACS2 output to create the control file. In both inputs (ACRs and control regions) the last column contains the number of Tn5 insertions. After estimating Tn5 density per kb, we simply set a threshold that excludes X% of control regions. So if your FDR is 0.01, this code will find the Tn5 density per kb that removes 99.9% of control regions. We then use that same threshold to filter the ACR input. Hope this helps!

Example command line usage

>Rscript filter_eFDR.R <input_ACRs.bed> <controls.bed> <FDR> <output_filename.bed>

Copy and paste the text below into a file name filter_eFDR.R

# load data
args <- commandArgs(T)
true.acrs <- as.character(args[1])
fake.acrs <- as.character(args[2])
fdr <- as.numeric(args[3])
output <- as.character(args[4])

# import
a <- read.table(true.acrs)
b <- read.table(fake.acrs)

# estimate Tn5 density per 1000bp
a$tn5_density <- a[,ncol(a)]/((a$V3-a$V2)/1000)
b$tn5_density <- b[,ncol(b)]/((b$V3-b$V2)/1000)

# find emprical threshold
threshold <- quantile(b$tn5_density, 1-fdr)

# filter ACRs
a.filtered <- subset(a, a$tn5_density > threshold)
a.filtered$tn5_density <- NULL
a.filtered[,c(1:(ncol(a.filtered)-1))]
write.table(a.filtered, file=output, quote=F, row.names=F, col.names=F, sep="\t")

@plantformatics Thank you for your detailed explanation and codes! I see what you mean. Again, thank you so much.