GuanLab/Leopard

Preprocessing

Closed this issue · 2 comments

Hello,

Thank you for this model. I am trying to replicate this to predict Tf binding in Drosophila. Could you please provide the details about preprocessing of the sequence and the DNase or ATAC files if using a different genome (Eg Drosophila/mouse) other than Human?

Thank you.

Regards,
Gunjan

Hi Gunjan,

I've updated the script for one-hot encoding DNA sequences. Before using it, you need to

  • Download the fasta files of Drosophila and save them in an input directory. For example, in line 55 of my script path1='./grch37/', it is where I save the GRCh37 fasta files of the human genome. The fasta files are named as chr1.fa, chr2.fa, ..., chr21.fa, chr22.fa, chrX.fa.
  • Modify the lengths of chromosomes for Drosophila in lines 51-52.

Regarding the preprocessing of DNase/ATAC files, you can follow the instructions here:
https://github.com/GuanLab/Leopard#quantile-normalization-for-new-data
You can select one specific cell line as the reference for the quantile normalization. Again, the lengths of chromosomes should be modified for Drosophila in the related scripts. Later on when you need to train models and make predictions, you need to modify similar things in the train.py and predict.py scripts. Be sure to use the same reference genome for both DNA sequence and the DNase/ATAC files.
If your DNase/ATAC files are not in bigwig format, it's better to convert them into bigwig format first.

Thank you,
Hongyang

Thank you so much for the clarification. I really appreciate it.

Regards,
Gunjan