kundajelab/chrombpnet

aggressive filtering when generating peaks for bias training

Closed this issue · 1 comments

I am trying to use the bias pipeline to generate peak and non peak data in order to train a bias model. It was my impression that if I use find_bias_hyperparams.py then it would create training, validation, and test peaks and non-peaks in order to train and evaluate a bias model. I have been trying to train a model using GM12878 from ENCODE in which I am using an input length of 2048 for my model. I follow the procedure of converting the ATAC-Seq bam files here and here for GM12878 into a bigwig file using reads_to_bigwig.py and then using chrombpnet prep nonpeaks to generate GC-matched non-peaks based on the narrowPeak bed file. I then run find_bias_hyperparams.py but I find that at the end of the process only 2 of 14 chromosomes have peaks remain. When looking at the output for fold_0 I see the following output:

evaluating hyperparameters on the following chromosomes ['chr2', 'chr4', 'chr5', 'chr7', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr21', 'chr22', 'chrX', 'chrY', 'chr8', 'chr20']
Number of non peaks input:  347484
Number of non peaks filtered because the input/output is on the edge:  304709
Number of non peaks being used:  42775
Number of non peaks input:  50047
Number of non peaks filtered because the input/output is on the edge:  27439
Number of non peaks being used:  22608
Number of peaks input:  173742
Number of peaks filtered because the input/output is on the edge:  151898
Number of peaks being used:  21844
Number of peaks input:  50047
Number of peaks filtered because the input/output is on the edge:  28789
Number of peaks being used:  21258
Upper bound counts cut-off for bias model training:  29.0
Number of nonpeaks after the upper-bount cut-off:  26586
Number of nonpeaks after applying upper-bound cut-off and removing outliers :  25078
counts_loss_weight: 1.6

it seems many peaks are being removed because of the 'input/output is on the edge' step, is this typical behavior in your experience? I am trying to figure out if I did something wrong on my end. When you guys train your bias models I am guessing you have candidate peak and non-peak across the genome? Any tips would be greatly appreciated.

I think I figured out the problem. Sorry!