[Bug] `TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'` when running `write_tfrecord_species.py`

Question

[Bug] `TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'` when running `write_tfrecord_species.py`

jolespin opened this issue 3 months ago · 3 comments

I'd like to train a model on some genomes I have available locally. I have my gene models in GFF format so I converted to GTF with gffread, extracted the longest isoform, then tried creating tfrecords but the script failed.

I'm attaching the input files in case it's useful:

$ python Tiberius/bin/write_tfrecord_species.py --fasta ${SPECIES}.fa --gtf ${SPECIES}.longest.gtf --out tfrecords/${SPECIES}
2024-10-07 21:43:22.216480: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/write_tfrecord_species.py", line 371, in <module>
    main()
  File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/write_tfrecord_species.py", line 309, in main
    fasta, ref = get_species_data_hmm(genome_path=args.fasta, annot_path=args.gtf, 
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/write_tfrecord_species.py", line 119, in get_species_data_hmm
    f_chunk = fasta.get_flat_chunks(strand='+', pad=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/SageMaker/efs/sandbox/sandbox/development/jolespin/Testing/EukGeneModeling/tiberius_testing/Tiberius/bin/genome_fasta.py", line 143, in get_flat_chunks
    // (self.chunksize - self.overlap) + 1
        ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

pt_mag.tar.gz

Answer 1 · 2024-10-08T08:58:49.000Z

Hi,

thanks for using Tiberius!

The script requires you have to use the --wsize argument to specify the sequence size of each training example. For example, we used --wsize 9999 for the training with the mammalian genomes.

I seem to have forgotten to include it in the documentation, I'm sorry.

Let me know if you encounter any other issues..

Best,
Lars

Answer 2 · 2024-10-09T16:17:52.000Z

Thanks for insight. Do you recommend a good parameter choice that I can use for diatoms? I'm testing out the model on a bunch of algae genomes that need gene calls. The alternative is MetaEuk but with my database it's using a considerable amount of resources.

Answer 3 · 2024-10-09T19:02:17.000Z

Please send me an email (or give me a phone call). I am willing to share results on this, possibly collaborate, but the current results are not so great that I want to publish a parameter set, yet. Josh L. Espinoza ***@***.***> schrieb am Mi. 9. Okt. 2024 um 18:18:

…

Do you recommend a good parameter choice that I can use for diatoms? — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJMC6JBCP76KSNHQM6MWYD3Z2VJMPAVCNFSM6AAAAABPQ3AQ3OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBSG43DGOBUGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>