kundajelab/chrombpnet

Error Encountered During Bias Model Training for ATAC-seq Data

Closed this issue · 11 comments

I am working on ATAC-seq mice data and have developed a function to run bias model training across all folds and comparison groups. While testing the function with fold 1 and the group fed_vs_fasted_0hr, I encountered the following error (I attach the details in the end):
IndexError: list index out of range

Debugging Steps: I checked the number of unique chromosomes in the input BAM file, narrowPeak files, and the background file (non-peak regions):

Input BAM file:

samtools view 0hr_Fasted_trimmomatic_intersect_DE_sort.bam | cut -f 3 | sort | uniq | wc -l
Result: 20

narrowPeak files:

cut -f1 0hr_Fasted_trimmomatic_peaks_no_blacklist.narrowPeak | sort | uniq | wc -l
Result: 20

Background file (non-peak regions):

cut -f1 0hr_Fasted_trimmomatic_fold_1_negatives.bed | sort | uniq | wc -l
Result: 21

fold_1.json:
{
"test": [
"chr1",
"chr2",
"chr3",
"chr4"
],
"valid": [
"chr5",
"chr6",
"chr7",
"chr8"
],
"train": [
"chrX",
"chr10",
"chr14",
"chr9",
"chr11",
"chr13",
"chr12",
"chr15",
"chr16",
"chr17",
"chrY",
"chr18",
"chr19"
]
}
Could you please help identify the cause of this error and suggest a possible solution?

Thank you!

Error:

  • 2024-08-30 11:56:30.285465: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
  • To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  • 2024-08-30 11:56:30.841725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43250 MB memory: -> device: 0, name: NVIDIA L40S, pci bus id: 0000:4a:00.0, compute capability: 8.9
  • /.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
  • return _methods._mean(a, axis=axis, dtype=dtype,
  • /.local/lib/python3.8/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
  • ret = ret.dtype.type(ret / rcount)
  • Estimating enzyme shift in input file
  • Current estimated shift: +4/-5
  • awk -v OFS="\t" '{if ($6=="+"){print $1,$2+0,$3,$4,$5,$6} else if ($6=="-") {print $1,$2,$3+1,$4,$5,$6}}' | sort -k1,1 | bedtools genomecov -bg -5 -i stdin -g /chrombpnet/mm10.chrom.subset.sizes | LC_COLLATE="C" sort -k1,1 -k2,2n
  • Making BedGraph (Filter chromosomes not in reference fasta)
  • Making Bigwig
  • non zero bigwig entries in the given chromosome: 9774
  • evaluating hyperparameters on the following chromosomes ['chrX', 'chr10', 'chr14', 'chr9', 'chr11', 'chr13', 'chr12', 'chr15', 'chr16', 'chr17', 'chrY', 'chr18', 'chr19', 'chr5', 'chr6', 'chr7', 'chr8']
  • Number of non peaks input: 1786
  • Number of non peaks filtered because the input/output is on the edge: 17
  • Number of non peaks being used: 1769
  • Number of non peaks input: 306
  • Number of non peaks filtered because the input/output is on the edge: 0
  • Number of non peaks being used: 306
  • Number of peaks input: 893
  • Number of peaks filtered because the input/output is on the edge: 0
  • Number of peaks being used: 893
  • Number of peaks input: 306
  • Number of peaks filtered because the input/output is on the edge: 0
  • Number of peaks being used: 306
  • Upper bound counts cut-off for bias model training: 195.408
  • Number of nonpeaks after the upper-bount cut-off: 1768
  • Number of nonpeaks after applying upper-bound cut-off and removing outliers : 0
  • counts_loss_weight: nan
  • {'counts_loss_weight': 'nan', 'filters': '128', 'n_dil_layers': '4', 'inputlen': '2114', 'outputlen': '1000', 'max_jitter': '0', 'chr_fold_path': '/chrombpnet/splits/fold_1.json', 'negative_sampling_ratio': '1.0'}
  • params:
  • filters:128
  • n_dil_layers:4
  • conv1_kernel_size:21
  • profile_kernel_size:75
  • counts_loss_weight:nan
  • got the model
  • loading nonpeaks...
  • got split:train for bed regions:(0, 10)
  • Traceback (most recent call last):
  • File "/ihome/crc/install/chrombpnet/0.1.7/bin/chrombpnet", line 8, in
  • sys.exit(main())
    
  • File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 38, in main
  • pipelines.train_bias_pipeline(args)
    
  • File "/chrombpnet/chrombpnet/pipelines.py", line 311, in train_bias_pipeline
  • train.main(args_copy)
    
  • File "/chrombpnet/chrombpnet/training/train.py", line 87, in main
  • train_generator = initializers.initialize_generators(args, "train", parameters, return_coords=False)
    
  • File "/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
  • generator=batchgen_generator.ChromBPNetBatchGenerator(
    
  • File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in init
  • peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
    
  • File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 87, in load_data
  • train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
    
  • File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 50, in get_seq_cts_coords
  • seq = get_seq(peaks_df, genome, input_width)
    
  • File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
  • return one_hot.dna_to_one_hot(vals)
    
  • File "/chrombpnet/chrombpnet/training/utils/one_hot.py", line 18, in dna_to_one_hot
  • seq_len = len(seqs[0])
    
  • IndexError: list index out of range

To add more background to my question.
For the purpose of footprinting, rather than identifying TF motifs using all peaks, I want to focus specifically on differential peaks identified through differential analysis (using DiffBind). To achieve this, I used bedtools intersect to extract differential peaks from the alignment BAM files (let's call these the DE-only BAM files for clarity). I set a minimum overlap of 10%.

Initially, I used the DE-only BAM files for peak calling with MACS2. However, when I checked the resulting narrowPeak files, I noticed they had extremely high scores, q-values, and p-values, and the bias model training failed. To address this, I used bedtools intersect again, this time to extract differential peaks from the narrowPeak files (which were generated from peak calling on the original alignment BAM files), creating what I'll refer to as the DE-only narrowPeak files.

Here is the workflow:
Screen Shot 2024-08-30 at 1 21 43 PM

Here's your text with some adjustments for clarity:

Based on this workflow, I have a few possible reasons for the issues encountered:

The DE-only narrowPeak files and the non-peak regions required for bias model training might be insufficient, leading to an empty data frame.
The DE-only narrowPeak files are not directly generated from peak calling on the DE-only BAM files. Instead, I manually extracted the DE peaks from the original narrowPeak files. As a result, the DE-only narrowPeak files might not correspond accurately with the DE-only BAM files.
Below is a table showing the number of peaks in the input files for both the DE-only BAM files and the DE-only narrowPeak files:

Screen Shot 2024-08-30 at 1 29 12 PM

Are you suggesting that I train the bias model using the original BAM file and narrowPeak files, and then apply this bias model to the DE-only narrowPeak and BAM files for bias-factorized ChromBPNet model training?
Thank you for your quick response!

I implemented the advice you provided, but I encountered a different issue when training the bias-factorized ChromBPNet model. Could you offer any insights into what might be causing this problem?

Thank you, and I hope you have a great Labor Day weekend!

File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 51, in init
self.crop_revcomp_data()
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 68, in crop_revcomp_data
self.sampled_nonpeak_seqs, self.sampled_nonpeak_cts, self.sampled_nonpeak_coords = subsample_nonpeak_data(self.nonpeak_seqs, self.nonpeak_cts, self.nonpeak_coords, len(self.peak_seqs), self.negative_sampling_ratio)
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 15, in subsample_nonpeak_data
nonpeak_indices_to_keep = np.random.choice(len(nonpeak_seqs), size=num_nonpeak_samples, replace=False)
File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'

I also tried using DE-only BAM files, while keeping the narrowpeaks/background input (non-peak regions) derived from all peaks (peak calling on the original BAM files). However, I encountered the following error:

Traceback (most recent call last):
  File "/install/chrombpnet/0.1.7/bin/chrombpnet", line 8, in <module>
    sys.exit(main())
  File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 23, in main
    pipelines.chrombpnet_train_pipeline(args)
  File "/chrombpnet/chrombpnet/pipelines.py", line 37, in chrombpnet_train_pipeline
    find_chrombpnet_hyperparams.main(args_copy)
  File "/chrombpnet/chrombpnet/helpers/hyperparameters/find_chrombpnet_hyperparams.py", line 138, in main
    assert(counts_loss_weight != 0)
AssertionError

The bias model training seems to be correct using the original BAM files and narrowpeaks/background input (without specifically extracting for DE peaks). I can proceed with using this model for bias-free modeling for original BAM files and narrowpeaks/background input. Only when I used DE-only narrowPeak/background and DE-only BAM files, I had errors.

Therefore, what will be a good practice for DE-only footprinting using ChromBpNet?

Thanks,

Best,
Yeque

I need to mention, that after extracting DE peaks from BAM file and NarrowPeak file, I have extreme small amount of peaks in the narrowpeak file and non-peak region files. I believe this is the reason why I had the following error.

<style> </style>
DE-only BAM files 253842 124847 1548469 1074761 536346 308790
DE-only narrowpeak files 255 253 1199 1170 930 774
Input background files (non-peaks regions) (fold 1) 442 438 2092 2041 1619 1358

I implemented the advice you provided, but I encountered a different issue when training the bias-factorized ChromBPNet model. Could you offer any insights into what might be causing this problem?

Thank you, and I hope you have a great Labor Day weekend!

> File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 51, in **init** self.crop_revcomp_data() File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 68, in crop_revcomp_data self.sampled_nonpeak_seqs, self.sampled_nonpeak_cts, self.sampled_nonpeak_coords = subsample_nonpeak_data(self.nonpeak_seqs, self.nonpeak_cts, self.nonpeak_coords, len(self.peak_seqs), self.negative_sampling_ratio) File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 15, in subsample_nonpeak_data nonpeak_indices_to_keep = np.random.choice(len(nonpeak_seqs), size=num_nonpeak_samples, replace=False) File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'

Hey, I am trying to decrease NEGATIVE_SAMPLING_RATIO to 0.01 and 0.001 by argument -sr to solve the error. I still want to know if my DE-only BAM/NarrowPeak files have very few peaks (since my DE peaks are few), what is the best practice and choice for me to run ChromBpNet?

Thank you!

Best,
Yeque

Dear ChromBPNet Team,

I discovered that the error may have been caused by having too few peaks in the input. To address this, I made the following changes:

  1. Reduced the negative sample ratio from the default 0.1 to 0.00001.
  2. Used p-values instead of adjusted p-values in the DE analysis to increase the number of DE peaks. This adjustment has resulted in more peaks for each sample when running ChromBPNet_model_training (see table below).

Despite these changes, I am now encountering a new error. Could you please provide guidance on how to resolve it? Additionally, could you help me understand the possible reasons behind this error?

Thank you for your assistance.

Best regards,
Yeque

<style> </style>
DE-only BAM files 5870159 3316261 7996392 5812338 5270957 4592235
DE-only narrowpeak files 10295 9990 17597 17178 14901 14372
Input background files (non-peaks regions) (fold 1) 17943 17402 30705 29957 25972 25075

Here is the error I encountered now:

Traceback (most recent call last):
  File "/chrombpnet/0.1.7/bin/chrombpnet", line 8, in <module>
    sys.exit(main())
  File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 23, in main
    pipelines.chrombpnet_train_pipeline(args)
  File "/chrombpnet/chrombpnet/pipelines.py", line 73, in chrombpnet_train_pipeline
    train.main(args_copy)
  File "/chrombpnet/chrombpnet/training/train.py", line 88, in main
    valid_generator = initializers.initialize_generators(args, "valid", parameters, return_coords=False)
  File "/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
    generator=batchgen_generator.ChromBPNetBatchGenerator(
  File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in __init__
    peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
  File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 87, in load_data
    train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
  File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 50, in get_seq_cts_coords
    seq = get_seq(peaks_df, genome, input_width)
  File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
    return one_hot.dna_to_one_hot(vals)
  File "/chrombpnet/chrombpnet/training/utils/one_hot.py", line 18, in dna_to_one_hot
    seq_len = len(seqs[0])
IndexError: list index out of range

You are right the problem is with few too peaks, please follow the peak generation setup as per the ENCODE ATAC-seq pipeline (you should expect atleast about 100k peaks)

Let me know if you see more issues, Thanks!