Gaius-Augustus/Tiberius

[Bug] `TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'` when running `write_tfrecord_species.py`

jolespin opened this issue · 3 comments

I'd like to train a model on some genomes I have available locally. I have my gene models in GFF format so I converted to GTF with gffread, extracted the longest isoform, then tried creating tfrecords but the script failed.

I'm attaching the input files in case it's useful:

$ python Tiberius/bin/write_tfrecord_species.py --fasta ${SPECIES}.fa --gtf ${SPECIES}.longest.gtf --out tfrecords/${SPECIES}
2024-10-07 21:43:22.216480: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/write_tfrecord_species.py", line 371, in <module>
    main()
  File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/write_tfrecord_species.py", line 309, in main
    fasta, ref = get_species_data_hmm(genome_path=args.fasta, annot_path=args.gtf, 
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/write_tfrecord_species.py", line 119, in get_species_data_hmm
    f_chunk = fasta.get_flat_chunks(strand='+', pad=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/genome_fasta.py", line 143, in get_flat_chunks
    // (self.chunksize - self.overlap) + 1
        ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

pt_mag.tar.gz

Hi,

thanks for using Tiberius!

The script requires you have to use the --wsize argument to specify the sequence size of each training example. For example, we used --wsize 9999 for the training with the mammalian genomes.

I seem to have forgotten to include it in the documentation, I'm sorry.

Let me know if you encounter any other issues..

Best,
Lars

Thanks for insight. Do you recommend a good parameter choice that I can use for diatoms? I'm testing out the model on a bunch of algae genomes that need gene calls. The alternative is MetaEuk but with my database it's using a considerable amount of resources.