Error Encountered During Bias Model Training for ATAC-seq Data
Closed this issue · 11 comments
I am working on ATAC-seq mice data and have developed a function to run bias model training across all folds and comparison groups. While testing the function with fold 1 and the group fed_vs_fasted_0hr, I encountered the following error (I attach the details in the end):
IndexError: list index out of range
Debugging Steps: I checked the number of unique chromosomes in the input BAM file, narrowPeak files, and the background file (non-peak regions):
Input BAM file:
samtools view 0hr_Fasted_trimmomatic_intersect_DE_sort.bam | cut -f 3 | sort | uniq | wc -l
Result: 20
narrowPeak files:
cut -f1 0hr_Fasted_trimmomatic_peaks_no_blacklist.narrowPeak | sort | uniq | wc -l
Result: 20
Background file (non-peak regions):
cut -f1 0hr_Fasted_trimmomatic_fold_1_negatives.bed | sort | uniq | wc -l
Result: 21
fold_1.json:
{
"test": [
"chr1",
"chr2",
"chr3",
"chr4"
],
"valid": [
"chr5",
"chr6",
"chr7",
"chr8"
],
"train": [
"chrX",
"chr10",
"chr14",
"chr9",
"chr11",
"chr13",
"chr12",
"chr15",
"chr16",
"chr17",
"chrY",
"chr18",
"chr19"
]
}
Could you please help identify the cause of this error and suggest a possible solution?
Thank you!
Error:
- 2024-08-30 11:56:30.285465: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
- To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
- 2024-08-30 11:56:30.841725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43250 MB memory: -> device: 0, name: NVIDIA L40S, pci bus id: 0000:4a:00.0, compute capability: 8.9
- /.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
- return _methods._mean(a, axis=axis, dtype=dtype,
- /.local/lib/python3.8/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
- ret = ret.dtype.type(ret / rcount)
- Estimating enzyme shift in input file
- Current estimated shift: +4/-5
- awk -v OFS="\t" '{if ($6=="+"){print $1,$2+0,$3,$4,$5,$6} else if ($6=="-") {print $1,$2,$3+1,$4,$5,$6}}' | sort -k1,1 | bedtools genomecov -bg -5 -i stdin -g /chrombpnet/mm10.chrom.subset.sizes | LC_COLLATE="C" sort -k1,1 -k2,2n
- Making BedGraph (Filter chromosomes not in reference fasta)
- Making Bigwig
- non zero bigwig entries in the given chromosome: 9774
- evaluating hyperparameters on the following chromosomes ['chrX', 'chr10', 'chr14', 'chr9', 'chr11', 'chr13', 'chr12', 'chr15', 'chr16', 'chr17', 'chrY', 'chr18', 'chr19', 'chr5', 'chr6', 'chr7', 'chr8']
- Number of non peaks input: 1786
- Number of non peaks filtered because the input/output is on the edge: 17
- Number of non peaks being used: 1769
- Number of non peaks input: 306
- Number of non peaks filtered because the input/output is on the edge: 0
- Number of non peaks being used: 306
- Number of peaks input: 893
- Number of peaks filtered because the input/output is on the edge: 0
- Number of peaks being used: 893
- Number of peaks input: 306
- Number of peaks filtered because the input/output is on the edge: 0
- Number of peaks being used: 306
- Upper bound counts cut-off for bias model training: 195.408
- Number of nonpeaks after the upper-bount cut-off: 1768
- Number of nonpeaks after applying upper-bound cut-off and removing outliers : 0
- counts_loss_weight: nan
- {'counts_loss_weight': 'nan', 'filters': '128', 'n_dil_layers': '4', 'inputlen': '2114', 'outputlen': '1000', 'max_jitter': '0', 'chr_fold_path': '/chrombpnet/splits/fold_1.json', 'negative_sampling_ratio': '1.0'}
- params:
- filters:128
- n_dil_layers:4
- conv1_kernel_size:21
- profile_kernel_size:75
- counts_loss_weight:nan
- got the model
- loading nonpeaks...
- got split:train for bed regions:(0, 10)
- Traceback (most recent call last):
- File "/ihome/crc/install/chrombpnet/0.1.7/bin/chrombpnet", line 8, in
sys.exit(main())
- File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 38, in main
pipelines.train_bias_pipeline(args)
- File "/chrombpnet/chrombpnet/pipelines.py", line 311, in train_bias_pipeline
train.main(args_copy)
- File "/chrombpnet/chrombpnet/training/train.py", line 87, in main
train_generator = initializers.initialize_generators(args, "train", parameters, return_coords=False)
- File "/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
generator=batchgen_generator.ChromBPNetBatchGenerator(
- File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in init
peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
- File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 87, in load_data
train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
- File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 50, in get_seq_cts_coords
seq = get_seq(peaks_df, genome, input_width)
- File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
return one_hot.dna_to_one_hot(vals)
- File "/chrombpnet/chrombpnet/training/utils/one_hot.py", line 18, in dna_to_one_hot
seq_len = len(seqs[0])
- IndexError: list index out of range
To add more background to my question.
For the purpose of footprinting, rather than identifying TF motifs using all peaks, I want to focus specifically on differential peaks identified through differential analysis (using DiffBind). To achieve this, I used bedtools intersect to extract differential peaks from the alignment BAM files (let's call these the DE-only BAM files for clarity). I set a minimum overlap of 10%.
Initially, I used the DE-only BAM files for peak calling with MACS2. However, when I checked the resulting narrowPeak files, I noticed they had extremely high scores, q-values, and p-values, and the bias model training failed. To address this, I used bedtools intersect again, this time to extract differential peaks from the narrowPeak files (which were generated from peak calling on the original alignment BAM files), creating what I'll refer to as the DE-only narrowPeak files.
Here's your text with some adjustments for clarity:
Based on this workflow, I have a few possible reasons for the issues encountered:
The DE-only narrowPeak files and the non-peak regions required for bias model training might be insufficient, leading to an empty data frame.
The DE-only narrowPeak files are not directly generated from peak calling on the DE-only BAM files. Instead, I manually extracted the DE peaks from the original narrowPeak files. As a result, the DE-only narrowPeak files might not correspond accurately with the DE-only BAM files.
Below is a table showing the number of peaks in the input files for both the DE-only BAM files and the DE-only narrowPeak files:
Are you suggesting that I train the bias model using the original BAM file and narrowPeak files, and then apply this bias model to the DE-only narrowPeak and BAM files for bias-factorized ChromBPNet model training?
Thank you for your quick response!
I implemented the advice you provided, but I encountered a different issue when training the bias-factorized ChromBPNet model. Could you offer any insights into what might be causing this problem?
Thank you, and I hope you have a great Labor Day weekend!
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 51, in init
self.crop_revcomp_data()
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 68, in crop_revcomp_data
self.sampled_nonpeak_seqs, self.sampled_nonpeak_cts, self.sampled_nonpeak_coords = subsample_nonpeak_data(self.nonpeak_seqs, self.nonpeak_cts, self.nonpeak_coords, len(self.peak_seqs), self.negative_sampling_ratio)
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 15, in subsample_nonpeak_data
nonpeak_indices_to_keep = np.random.choice(len(nonpeak_seqs), size=num_nonpeak_samples, replace=False)
File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'
I also tried using DE-only BAM files, while keeping the narrowpeaks/background input (non-peak regions) derived from all peaks (peak calling on the original BAM files). However, I encountered the following error:
Traceback (most recent call last):
File "/install/chrombpnet/0.1.7/bin/chrombpnet", line 8, in <module>
sys.exit(main())
File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 23, in main
pipelines.chrombpnet_train_pipeline(args)
File "/chrombpnet/chrombpnet/pipelines.py", line 37, in chrombpnet_train_pipeline
find_chrombpnet_hyperparams.main(args_copy)
File "/chrombpnet/chrombpnet/helpers/hyperparameters/find_chrombpnet_hyperparams.py", line 138, in main
assert(counts_loss_weight != 0)
AssertionError
The bias model training seems to be correct using the original BAM files and narrowpeaks/background input (without specifically extracting for DE peaks). I can proceed with using this model for bias-free modeling for original BAM files and narrowpeaks/background input. Only when I used DE-only narrowPeak/background and DE-only BAM files, I had errors.
Therefore, what will be a good practice for DE-only footprinting using ChromBpNet?
Thanks,
Best,
Yeque
I need to mention, that after extracting DE peaks from BAM file and NarrowPeak file, I have extreme small amount of peaks in the narrowpeak file and non-peak region files. I believe this is the reason why I had the following error.
<style> </style>DE-only BAM files | 253842 | 124847 | 1548469 | 1074761 | 536346 | 308790 |
---|---|---|---|---|---|---|
DE-only narrowpeak files | 255 | 253 | 1199 | 1170 | 930 | 774 |
Input background files (non-peaks regions) (fold 1) | 442 | 438 | 2092 | 2041 | 1619 | 1358 |
I implemented the advice you provided, but I encountered a different issue when training the bias-factorized ChromBPNet model. Could you offer any insights into what might be causing this problem?
Thank you, and I hope you have a great Labor Day weekend!
> File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 51, in **init** self.crop_revcomp_data() File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 68, in crop_revcomp_data self.sampled_nonpeak_seqs, self.sampled_nonpeak_cts, self.sampled_nonpeak_coords = subsample_nonpeak_data(self.nonpeak_seqs, self.nonpeak_cts, self.nonpeak_coords, len(self.peak_seqs), self.negative_sampling_ratio) File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 15, in subsample_nonpeak_data nonpeak_indices_to_keep = np.random.choice(len(nonpeak_seqs), size=num_nonpeak_samples, replace=False) File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'
Hey, I am trying to decrease NEGATIVE_SAMPLING_RATIO to 0.01 and 0.001 by argument -sr to solve the error. I still want to know if my DE-only BAM/NarrowPeak files have very few peaks (since my DE peaks are few), what is the best practice and choice for me to run ChromBpNet?
Thank you!
Best,
Yeque
Dear ChromBPNet Team,
I discovered that the error may have been caused by having too few peaks in the input. To address this, I made the following changes:
- Reduced the negative sample ratio from the default 0.1 to 0.00001.
- Used p-values instead of adjusted p-values in the DE analysis to increase the number of DE peaks. This adjustment has resulted in more peaks for each sample when running ChromBPNet_model_training (see table below).
Despite these changes, I am now encountering a new error. Could you please provide guidance on how to resolve it? Additionally, could you help me understand the possible reasons behind this error?
Thank you for your assistance.
Best regards,
Yeque
DE-only BAM files | 5870159 | 3316261 | 7996392 | 5812338 | 5270957 | 4592235 |
---|---|---|---|---|---|---|
DE-only narrowpeak files | 10295 | 9990 | 17597 | 17178 | 14901 | 14372 |
Input background files (non-peaks regions) (fold 1) | 17943 | 17402 | 30705 | 29957 | 25972 | 25075 |
Here is the error I encountered now:
Traceback (most recent call last):
File "/chrombpnet/0.1.7/bin/chrombpnet", line 8, in <module>
sys.exit(main())
File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 23, in main
pipelines.chrombpnet_train_pipeline(args)
File "/chrombpnet/chrombpnet/pipelines.py", line 73, in chrombpnet_train_pipeline
train.main(args_copy)
File "/chrombpnet/chrombpnet/training/train.py", line 88, in main
valid_generator = initializers.initialize_generators(args, "valid", parameters, return_coords=False)
File "/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
generator=batchgen_generator.ChromBPNetBatchGenerator(
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in __init__
peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 87, in load_data
train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 50, in get_seq_cts_coords
seq = get_seq(peaks_df, genome, input_width)
File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
return one_hot.dna_to_one_hot(vals)
File "/chrombpnet/chrombpnet/training/utils/one_hot.py", line 18, in dna_to_one_hot
seq_len = len(seqs[0])
IndexError: list index out of range
You are right the problem is with few too peaks, please follow the peak generation setup as per the ENCODE ATAC-seq pipeline (you should expect atleast about 100k peaks)
Let me know if you see more issues, Thanks!