[Bug] `TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'` when running `write_tfrecord_species.py`
jolespin opened this issue · 3 comments
I'd like to train a model on some genomes I have available locally. I have my gene models in GFF format so I converted to GTF with gffread
, extracted the longest isoform, then tried creating tfrecords
but the script failed.
I'm attaching the input files in case it's useful:
$ python Tiberius/bin/write_tfrecord_species.py --fasta ${SPECIES}.fa --gtf ${SPECIES}.longest.gtf --out tfrecords/${SPECIES}
2024-10-07 21:43:22.216480: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/write_tfrecord_species.py", line 371, in <module>
main()
File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/write_tfrecord_species.py", line 309, in main
fasta, ref = get_species_data_hmm(genome_path=args.fasta, annot_path=args.gtf,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/write_tfrecord_species.py", line 119, in get_species_data_hmm
f_chunk = fasta.get_flat_chunks(strand='+', pad=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/genome_fasta.py", line 143, in get_flat_chunks
// (self.chunksize - self.overlap) + 1
~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'
Hi,
thanks for using Tiberius!
The script requires you have to use the --wsize
argument to specify the sequence size of each training example. For example, we used --wsize 9999
for the training with the mammalian genomes.
I seem to have forgotten to include it in the documentation, I'm sorry.
Let me know if you encounter any other issues..
Best,
Lars
Thanks for insight. Do you recommend a good parameter choice that I can use for diatoms? I'm testing out the model on a bunch of algae genomes that need gene calls. The alternative is MetaEuk but with my database it's using a considerable amount of resources.