Error Encountered During Bias Model Training for ATAC-seq Data

Question

Error Encountered During Bias Model Training for ATAC-seq Data

Closed this issue 10 days ago · 11 comments

I am working on ATAC-seq mice data and have developed a function to run bias model training across all folds and comparison groups. While testing the function with fold 1 and the group fed_vs_fasted_0hr, I encountered the following error (I attach the details in the end):
IndexError: list index out of range

Debugging Steps: I checked the number of unique chromosomes in the input BAM file, narrowPeak files, and the background file (non-peak regions):

Input BAM file:

samtools view 0hr_Fasted_trimmomatic_intersect_DE_sort.bam | cut -f 3 | sort | uniq | wc -l
Result: 20

narrowPeak files:

cut -f1 0hr_Fasted_trimmomatic_peaks_no_blacklist.narrowPeak | sort | uniq | wc -l
Result: 20

Background file (non-peak regions):

cut -f1 0hr_Fasted_trimmomatic_fold_1_negatives.bed | sort | uniq | wc -l
Result: 21

fold_1.json:
{
"test": [
"chr1",
"chr2",
"chr3",
"chr4"
],
"valid": [
"chr5",
"chr6",
"chr7",
"chr8"
],
"train": [
"chrX",
"chr10",
"chr14",
"chr9",
"chr11",
"chr13",
"chr12",
"chr15",
"chr16",
"chr17",
"chrY",
"chr18",
"chr19"
]
}
Could you please help identify the cause of this error and suggest a possible solution?

Thank you!

Error:

2024-08-30 11:56:30.285465: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

2024-08-30 11:56:30.841725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43250 MB memory: -> device: 0, name: NVIDIA L40S, pci bus id: 0000:4a:00.0, compute capability: 8.9

/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.

return _methods._mean(a, axis=axis, dtype=dtype,

/.local/lib/python3.8/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars

ret = ret.dtype.type(ret / rcount)

Estimating enzyme shift in input file

Current estimated shift: +4/-5

awk -v OFS="\t" '{if ($6=="+"){print $1,$2+0,$3,$4,$5,$6} else if ($6=="-") {print $1,$2,$3+1,$4,$5,$6}}' | sort -k1,1 | bedtools genomecov -bg -5 -i stdin -g /chrombpnet/mm10.chrom.subset.sizes | LC_COLLATE="C" sort -k1,1 -k2,2n

Making BedGraph (Filter chromosomes not in reference fasta)

Making Bigwig

non zero bigwig entries in the given chromosome: 9774

evaluating hyperparameters on the following chromosomes ['chrX', 'chr10', 'chr14', 'chr9', 'chr11', 'chr13', 'chr12', 'chr15', 'chr16', 'chr17', 'chrY', 'chr18', 'chr19', 'chr5', 'chr6', 'chr7', 'chr8']

Number of non peaks input: 1786

Number of non peaks filtered because the input/output is on the edge: 17

Number of non peaks being used: 1769

Number of non peaks input: 306

Number of non peaks filtered because the input/output is on the edge: 0

Number of non peaks being used: 306

Number of peaks input: 893

Number of peaks filtered because the input/output is on the edge: 0

Number of peaks being used: 893

Number of peaks input: 306

Number of peaks filtered because the input/output is on the edge: 0

Number of peaks being used: 306

Upper bound counts cut-off for bias model training: 195.408

Number of nonpeaks after the upper-bount cut-off: 1768

Number of nonpeaks after applying upper-bound cut-off and removing outliers : 0

counts_loss_weight: nan

{'counts_loss_weight': 'nan', 'filters': '128', 'n_dil_layers': '4', 'inputlen': '2114', 'outputlen': '1000', 'max_jitter': '0', 'chr_fold_path': '/chrombpnet/splits/fold_1.json', 'negative_sampling_ratio': '1.0'}

params:

filters:128

n_dil_layers:4

conv1_kernel_size:21

profile_kernel_size:75

counts_loss_weight:nan

got the model

loading nonpeaks...

got split:train for bed regions:(0, 10)

Traceback (most recent call last):

File "/ihome/crc/install/chrombpnet/0.1.7/bin/chrombpnet", line 8, in
sys.exit(main())
File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 38, in main
pipelines.train_bias_pipeline(args)
File "/chrombpnet/chrombpnet/pipelines.py", line 311, in train_bias_pipeline
train.main(args_copy)
File "/chrombpnet/chrombpnet/training/train.py", line 87, in main
train_generator = initializers.initialize_generators(args, "train", parameters, return_coords=False)
File "/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
generator=batchgen_generator.ChromBPNetBatchGenerator(
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in init
peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 87, in load_data
train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 50, in get_seq_cts_coords
seq = get_seq(peaks_df, genome, input_width)
File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
return one_hot.dna_to_one_hot(vals)
File "/chrombpnet/chrombpnet/training/utils/one_hot.py", line 18, in dna_to_one_hot
seq_len = len(seqs[0])
IndexError: list index out of range

Answer 1 · 2024-08-30T17:29:26.000Z

To add more background to my question.
For the purpose of footprinting, rather than identifying TF motifs using all peaks, I want to focus specifically on differential peaks identified through differential analysis (using DiffBind). To achieve this, I used bedtools intersect to extract differential peaks from the alignment BAM files (let's call these the DE-only BAM files for clarity). I set a minimum overlap of 10%.

Initially, I used the DE-only BAM files for peak calling with MACS2. However, when I checked the resulting narrowPeak files, I noticed they had extremely high scores, q-values, and p-values, and the bias model training failed. To address this, I used bedtools intersect again, this time to extract differential peaks from the narrowPeak files (which were generated from peak calling on the original alignment BAM files), creating what I'll refer to as the DE-only narrowPeak files.

Here is the workflow:

Here's your text with some adjustments for clarity:

Based on this workflow, I have a few possible reasons for the issues encountered:

The DE-only narrowPeak files and the non-peak regions required for bias model training might be insufficient, leading to an empty data frame.
The DE-only narrowPeak files are not directly generated from peak calling on the DE-only BAM files. Instead, I manually extracted the DE peaks from the original narrowPeak files. As a result, the DE-only narrowPeak files might not correspond accurately with the DE-only BAM files.
Below is a table showing the number of peaks in the input files for both the DE-only BAM files and the DE-only narrowPeak files:

Answer 2 · 2024-08-30T17:34:59.000Z

Please train the bias model as we have instructed. There is no need to train it based on DE peaks or their exclusion only. You can focus model training on whatever peaks u want. Bias model should be trained as we have instructed.

…

On Fri, Aug 30, 2024 at 10:29 AM ys3508 ***@***.***> wrote: To add more background to my question. For the purpose of footprinting, rather than identifying TF motifs using all peaks, I want to focus specifically on differential peaks identified through differential analysis (using DiffBind). To achieve this, I used bedtools intersect to extract differential peaks from the alignment BAM files (let's call these the DE-only BAM files for clarity). I set a minimum overlap of 10%. Initially, I used the DE-only BAM files for peak calling with MACS2. However, when I checked the resulting narrowPeak files, I noticed they had extremely high scores, q-values, and p-values, and the bias model training failed. To address this, I used bedtools intersect again, this time to extract differential peaks from the narrowPeak files (which were generated from peak calling on the original alignment BAM files), creating what I'll refer to as the DE-only narrowPeak files. Here is the workflow: Screen.Shot.2024-08-30.at.1.21.43.PM.png (view on web) <https://github.com/user-attachments/assets/a93799c0-095e-4dba-a908-47403891b63c> Here's your text with some adjustments for clarity: Based on this workflow, I have a few possible reasons for the issues encountered: The DE-only narrowPeak files and the non-peak regions required for bias model training might be insufficient, leading to an empty data frame. The DE-only narrowPeak files are not directly generated from peak calling on the DE-only BAM files. Instead, I manually extracted the DE peaks from the original narrowPeak files. As a result, the DE-only narrowPeak files might not correspond accurately with the DE-only BAM files. Below is a table showing the number of peaks in the input files for both the DE-only BAM files and the DE-only narrowPeak files: Screen.Shot.2024-08-30.at.1.29.12.PM.png (view on web) <https://github.com/user-attachments/assets/31b296bb-8b3e-4942-bcee-4385fb2fb9cb> — Reply to this email directly, view it on GitHub <#205 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDWEIOKALMXYCQ6YQ34YTZUCTYZAVCNFSM6AAAAABNMZAZPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGAZDSNZYGU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 3 · 2024-08-30T17:53:58.000Z

Are you suggesting that I train the bias model using the original BAM file and narrowPeak files, and then apply this bias model to the DE-only narrowPeak and BAM files for bias-factorized ChromBPNet model training?
Thank you for your quick response!

Answer 4 · 2024-08-30T19:44:00.000Z

Yes correct

…

On Fri, Aug 30, 2024 at 10:54 AM ys3508 ***@***.***> wrote: Are you suggesting that I train the bias model using the original BAM file and narrowPeak files, and then apply this bias model to the DE-only narrowPeak and BAM files for bias-factorized ChromBPNet model training? Thank you for your quick response! — Reply to this email directly, view it on GitHub <#205 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDWELMBXDPOPYK2CVWPJLZUCWUZAVCNFSM6AAAAABNMZAZPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGA3DIOJUG4> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 5 · 2024-09-01T15:33:03.000Z

I implemented the advice you provided, but I encountered a different issue when training the bias-factorized ChromBPNet model. Could you offer any insights into what might be causing this problem?

Thank you, and I hope you have a great Labor Day weekend!

File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 51, in init
self.crop_revcomp_data()
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 68, in crop_revcomp_data
self.sampled_nonpeak_seqs, self.sampled_nonpeak_cts, self.sampled_nonpeak_coords = subsample_nonpeak_data(self.nonpeak_seqs, self.nonpeak_cts, self.nonpeak_coords, len(self.peak_seqs), self.negative_sampling_ratio)
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 15, in subsample_nonpeak_data
nonpeak_indices_to_keep = np.random.choice(len(nonpeak_seqs), size=num_nonpeak_samples, replace=False)
File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'

Answer 6 · 2024-09-04T14:57:43.000Z

I also tried using DE-only BAM files, while keeping the narrowpeaks/background input (non-peak regions) derived from all peaks (peak calling on the original BAM files). However, I encountered the following error:

Traceback (most recent call last):
  File "/install/chrombpnet/0.1.7/bin/chrombpnet", line 8, in <module>
    sys.exit(main())
  File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 23, in main
    pipelines.chrombpnet_train_pipeline(args)
  File "/chrombpnet/chrombpnet/pipelines.py", line 37, in chrombpnet_train_pipeline
    find_chrombpnet_hyperparams.main(args_copy)
  File "/chrombpnet/chrombpnet/helpers/hyperparameters/find_chrombpnet_hyperparams.py", line 138, in main
    assert(counts_loss_weight != 0)
AssertionError

The bias model training seems to be correct using the original BAM files and narrowpeaks/background input (without specifically extracting for DE peaks). I can proceed with using this model for bias-free modeling for original BAM files and narrowpeaks/background input. Only when I used DE-only narrowPeak/background and DE-only BAM files, I had errors.

Therefore, what will be a good practice for DE-only footprinting using ChromBpNet?

Thanks,

Best,
Yeque

Answer 7 · 2024-09-04T17:23:31.000Z

I need to mention, that after extracting DE peaks from BAM file and NarrowPeak file, I have extreme small amount of peaks in the narrowpeak file and non-peak region files. I believe this is the reason why I had the following error.

DE-only BAM files	253842	124847	1548469	1074761	536346	308790
DE-only narrowpeak files	255	253	1199	1170	930	774
Input background files (non-peaks regions) (fold 1)	442	438	2092	2041	1619	1358

I implemented the advice you provided, but I encountered a different issue when training the bias-factorized ChromBPNet model. Could you offer any insights into what might be causing this problem?

Thank you, and I hope you have a great Labor Day weekend!

> File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 51, in **init** self.crop_revcomp_data() File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 68, in crop_revcomp_data self.sampled_nonpeak_seqs, self.sampled_nonpeak_cts, self.sampled_nonpeak_coords = subsample_nonpeak_data(self.nonpeak_seqs, self.nonpeak_cts, self.nonpeak_coords, len(self.peak_seqs), self.negative_sampling_ratio) File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 15, in subsample_nonpeak_data nonpeak_indices_to_keep = np.random.choice(len(nonpeak_seqs), size=num_nonpeak_samples, replace=False) File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'

Answer 8 · 2024-09-06T14:26:49.000Z

Hey, I am trying to decrease NEGATIVE_SAMPLING_RATIO to 0.01 and 0.001 by argument -sr to solve the error. I still want to know if my DE-only BAM/NarrowPeak files have very few peaks (since my DE peaks are few), what is the best practice and choice for me to run ChromBpNet?

Thank you!

Best,
Yeque

Answer 9 · 2024-09-13T20:10:59.000Z

Dear ChromBPNet Team,

I discovered that the error may have been caused by having too few peaks in the input. To address this, I made the following changes:

Reduced the negative sample ratio from the default 0.1 to 0.00001.
Used p-values instead of adjusted p-values in the DE analysis to increase the number of DE peaks. This adjustment has resulted in more peaks for each sample when running ChromBPNet_model_training (see table below).

Despite these changes, I am now encountering a new error. Could you please provide guidance on how to resolve it? Additionally, could you help me understand the possible reasons behind this error?

Thank you for your assistance.

Best regards,
Yeque

DE-only BAM files	5870159	3316261	7996392	5812338	5270957	4592235
DE-only narrowpeak files	10295	9990	17597	17178	14901	14372
Input background files (non-peaks regions) (fold 1)	17943	17402	30705	29957	25972	25075

Here is the error I encountered now:

Traceback (most recent call last):
  File "/chrombpnet/0.1.7/bin/chrombpnet", line 8, in <module>
    sys.exit(main())
  File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 23, in main
    pipelines.chrombpnet_train_pipeline(args)
  File "/chrombpnet/chrombpnet/pipelines.py", line 73, in chrombpnet_train_pipeline
    train.main(args_copy)
  File "/chrombpnet/chrombpnet/training/train.py", line 88, in main
    valid_generator = initializers.initialize_generators(args, "valid", parameters, return_coords=False)
  File "/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
    generator=batchgen_generator.ChromBPNetBatchGenerator(
  File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in __init__
    peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
  File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 87, in load_data
    train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
  File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 50, in get_seq_cts_coords
    seq = get_seq(peaks_df, genome, input_width)
  File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
    return one_hot.dna_to_one_hot(vals)
  File "/chrombpnet/chrombpnet/training/utils/one_hot.py", line 18, in dna_to_one_hot
    seq_len = len(seqs[0])
IndexError: list index out of range

Answer 10 · 2024-12-18T18:14:46.000Z

You are right the problem is with few too peaks, please follow the peak generation setup as per the ENCODE ATAC-seq pipeline (you should expect atleast about 100k peaks)

Answer 11 · 2024-12-18T18:19:09.000Z

Let me know if you see more issues, Thanks!