Python library: Using GetCtgSeq() returns sequences with numerous leading repetitions of "\x00"
patrick-koenig opened this issue · 10 comments
Hello,
many thanks for this very useful tool and the Python bindings!
While integrating an AGC archive into a web service, I discovered the following behaviour of the GetCtgSeq() method:
seq = agc_arch.GetCtgSeq('Morex', 'chr1H', 400000000, 400000099)
print(seq)
Output:
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00GAACAGATCCACACCGCAAACGGAGCAGGTATGAGGATTCAGCGTGTTGGTCATTCAGTTATAATAACCCCTCATCAAAAATTCCATCTTAGAAATATTC'
As one can see the sequence itself is prefixed with numerous repetitions of "\x00".
One can easily remove them by using seq.lstrip('\x00')
but it would be better if this is not necessary with a future release.
Thanks
Is it possible to provide me the dataset for which this happens?
We plan to release AGC v.3.1 today or just after the weekend. If possible we would fix this bug.
@sebastiandeorowicz
Yes, with pleasure!
Do you need the final archive file or the FASTA files? Just as a note: The FASTA files look absolutely normal.
The used FASTA files can be downloaded here: https://galaxy-web.ipk-gatersleben.de/libraries/folders/F9ce617df390851fc/page/1
Command line was those from the examples in the README. Sorry, I did not record the command. But it was nothing special. More or less default parameters for agc create
.
The problem appears only when using the Python bindings. If I use command line executable, everything is fine with the sequence extraction. I guess the bug is located in the source code for the Python binding.
Probably you are right. I'm downloading and let you know after fixing the bug. I hope we will be able to include the fix in v.3.1.
One more thing. Which sample is the reference?
The reference was the genotype "Morex" with this filename: Morex_v3_pseudomolecules_and_unplaced_contigs_ENA.fasta.gz
What I forgot to mention: The preceding repititions of \x00
are independet from the parameters of the GetCtgSeq()
method. It doesn't matter if I set start to the first nucleotide (position = 0) or to some position near the end of the contig. It also doesn't matter which genotype or contig I choose.
OS is Rocky Linux 9 with Python 3.11 via miniconda. To make the bindings work with Python 3.11 I switched pybind11 to version 2.11.1. But the \x00
issue also appeared with Python 3.10 and pybind11-2.8.1
as it is in your repository here.
You were right. The bug is related to the connection between the C++ and Python libraries. We will fix it in the forthcoming release.
AGC 3.1 with the fix is released.