kundajelab/chrombpnet

Model performance: Pearson correlation score

arodel21 opened this issue · 9 comments

Hello again

I hope you have all enjoyed the winter break.

I just wanted to ask advice on how to improve the model performance, specifically ChromBPNet's Pearson correlation score in peaks. The overall report shows that pearsonr score (0.197) is below the threshold (0.5) when a well-performing model should have higher values. Do you have any thoughts on what could potentially cause the PCS to diverge?

Thanks in advance.

Can you share the html reports?

Here it goes!

overall_report.pdf

Thank you!

Hmm...something doesn't look right.

(1) Can you share a browser session with the observed bigwig and peaks?
(2) Can you share some stats about your input data - read depth, fraction of reads in peaks, if you have multiple replicates some concordance metrics etc maybe?
(3) What is the content in your peaks? Can you share a few rows in your peak file ?

Also for completeness share the command you are using to train the models.

Thanks for the reply.

Just to confirm and considering the parameters used in chrombpnet pipeline command, by bigwig do you refer to the bigwig of the ibam file?

After you confirm I will get you the stats.

I have been thinking that the correlation difference might be due to the background of the reads and the peaks. The peaks are genomic regions specific to a cell type, while the reads contain multiple cell types, including the one used for the peaks. Do you think this could influence the pearson correlation score at all?

The commands I used are

  • For creating nonpeaks
bedtools slop -i $blacklist -g $chrom_sizes -b 1057 > temp.bed
bedtools intersect -v -a $peaks -b temp.bed  > peaks_no_blacklist.bed
chrombpnet prep nonpeaks -g $genome -p peaks_no_blacklist.bed -c  $chrom_sizes -fl $splits -br $blacklist -o output -il 2114

Where genome is GRCz11.fa.

  • For training the bias model:
    chrombpnet bias pipeline -ibam $bam -d "ATAC" -g $genome -c $chrom_sizes -p peaks_no_blacklist.bed -n output_negatives.bed -fl $fold -b 0.5 -o bias_model/ -fp k562

  • For training ChromBPNet model:
    chrombpnet pipeline -ibam $bam -d "ATAC" -g $genome -c $chrom_sizes -p peaks_no_blacklist.bed -n output_negatives.bed -fl $fold -b bias_model/models/k562_bias.h5 -o $output

The peaks are zebrafish embryo enhancer genomic regions for a specific cell type and here are some of the peaks

chr1	5410234	5410901	.	.	.	.	.	.	333
chr1	7893549	7894421	.	.	.	.	.	.	436
chr1	59180459	59181068	.	.	.	.	.	.	304
chr2	40107174	40107673	.	.	.	.	.	.	249
chr4	129992	130402	.	.	.	.	.	.	205
chr4	5254977	5255424	.	.	.	.	.	.	223
chr4	8496741	8497364	.	.	.	.	.	.	311
chr7	29864191	29864685	.	.	.	.	.	.	247
chr7	43838799	43839225	.	.	.	.	.	.	213
chr8	30442994	30443692	.	.	.	.	.	.	349
chr8	38527582	38528079	.	.	.	.	.	.	248
chr9	29665717	29666279	.	.	.	.	.	.	281
chr9	42978827	42979288	.	.	.	.	.	.	230
chr14	33296076	33297494	.	.	.	.	.	.	709
chr15	2799590	2800235	.	.	.	.	.	.	322
chr15	9888201	9888589	.	.	.	.	.	.	194
chr15	31109078	31109710	.	.	.	.	.	.	316
chr19	18991133	18991665	.	.	.	.	.	.	266
chr23	23042932	23043267	.	.	.	.	.	.	167
chr25	2641708	2642175	.	.	.	.	.	.	233

When you say The peaks are genomic regions specific to a cell type, while the reads contain multiple cell types - you are merging reads across multiple cell-types (which ones?) but the peaks themselves are specific to one celltype (again which one)?

What is the goal of your model?

Closing this due to inactivity, feel free to open this if you continue to see issues.