calico/basenji

Issue running basenji_data.py

petnas opened this issue · 2 comments

Hi! First of all, great work on Basenji!

I tried to run basenji_data.py and the provided example data runs successfuly but when I change the provided .bw file to my file it throws an error.

Example:
/data/leuven/345/vsc34527/miniconda3/envs/basenji5/bin/python /vsc-hard-mounts/leuven-data/345/vsc34527/enformer_training/basenji/bin/basenji_data.py -s .1 -g data/unmap_macro.bed -l 131072 --local --restart -o data/heart_l131k -p 8 -t .1 -v .1 -w 128 data/hg19.ml.fa data/heart_wigs.txt

My data:
/data/leuven/345/vsc34527/miniconda3/envs/basenji5/bin/python /vsc-hard-mounts/leuven-data/345/vsc34527/enformer_training/basenji/bin/basenji_data.py -s .1 -g data/unmap_macro.bed -l 131072 --local --restart -o data/microglia_output -p 8 -t .1 -v .1 -w 128 data/hg19.ml.fa data/microglia.txt

Error:

basenji_data_write.py -s 1679 -e 1858 --umap_clip 1.000000 -x 0 data/hg19.ml.fa data/microglia_output/sequences.bed data/microglia_output/seqs_cov data/microglia_output/tfrecords/test-0.tfr
Traceback (most recent call last):
File "/vsc-hard-mounts/leuven-data/345/vsc34527/enformer_training/basenji/bin/basenji_data_write.py", line 240, in
main()
File "/vsc-hard-mounts/leuven-data/345/vsc34527/enformer_training/basenji/bin/basenji_data_write.py", line 106, in main
seq_pool_len = h5py.File(seqs_cov_files[0], 'r')['targets'].shape[1]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/data/leuven/345/vsc34527/miniconda3/envs/basenji5/lib/python3.8/site-packages/h5py/_hl/group.py", line 328, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'targets' doesn't exist)"

The error is much longer and this is just the first iteration to save space.

The bw file is from GEO and it is based on hg38. Is it the case that basenji_data.py can only use bw files made from bed files using your cam_cov.py?

I was wondering if anyone ever encountered a similar issue or is it just me making a silly mistake somewhere...

Thank you,
Petras

Hi!

I'm not sure if this will help you, but I'm working on the same thing right now and I managed to use .bw files from encode without using the bam_cov.py script. In my output the command for the basenji_data_write.py call looks like this:

basenji/bin/basenji_data_write.py -s 20480 -e 20736 --umap_clip 1.000000 -x 0 genomes/hg38.ml.fa data/basenji_preprocess/output_tfr/sequences.bed data/basenji_preprocess/output_tfr/seqs_cov data/basenji_preprocess/output_tfr/tfrecords/train-80.tfr

I did put a print statement to get this line (around line 440 in basenji_data.py) so it might be different due to that. But in your error, it looks like the last 3 options are missing somehow for your call to the write script?

Not sure if it helps & otherwise no worries!

Hi Petras, most likely the problem is that your BigWig is hg38 and your FASTA is hg19. Try again with hg38 FASTA. You'll want to drop the blacklist and unmappable, or replace with hg38 versions, too.