Error: invalid base: 0067
Opened this issue · 6 comments
I'm trying to use simpleaf to build an index for Glycine max (soybean). The genome and gtf files required some preprocessing to get them properly formatted.
I ran the following command (using simpleaf 0.16.2
):
simpleaf index --output simpleaf_index --fasta ../../g_max.genome.fasta --gtf ../../g_max.longest_transcripts.gtf --rlen 91 --threads 16 --use-piscem
Which resulted in the following output:
2024-06-04T10:05:48.261414Z INFO simpleaf::simpleaf_commands::indexing: preparing to make reference with roers
2024-06-04T10:05:50.342651Z INFO grangers::reader::gtf: Finished parsing the input file. Found 0 comments and 752330 records.
2024-06-04T10:05:51.029383Z INFO roers: Built the Grangers object for 752330 records
2024-06-04T10:05:51.237147Z WARN grangers::grangers_info: The exon_number column contains null values. Will compute the exon number from exon start position .
2024-06-04T10:05:51.527120Z WARN roers: Found missing gene_id and/or gene_name; Imputing. If both missing, will impute using transcript_id; Otherwise, will impute using the existing one.
2024-06-04T10:05:51.549542Z INFO roers: Proceed 278761 exon records from 55589 transcripts
Error: invalid base: 0067
The error message is a bit cryptic, so I don't really know what to do. I tried searching some of the rust repositories but haven't found the error message source yet.
If relevant I can provide the genome and gtf files.
EDIT:
Upon further investigation this seems to stem from the noodles crate: https://github.com/zaeleus/noodles/blob/906f5237c68fc6b04a73010580d3c4fed2c7b66e/noodles-fasta/src/record/sequence/complement.rs#L24. However, I don't really understand what's wrong yet.
Quick python check:
>>> bytes([67])
b'C'
Which should be possible to reverse complement?
The error message is printing the invalid character in hex, though there's a slight issue with the formatting. 0x67
is g
. Complementing lowercase bases was recently added to noodles 0.75.0/noodles-fasta 0.39.0 in zaeleus/noodles#267.
Thanks @zaeleus! @holmrenser — perhaps a quick solution to the problem would be to convert the lowercase reference sequence to uppercase? Something like:
seqkit seq -u input_ref.fa > output_ref.fa
with the excellent seqkit
tool?
@DongzeHE: We should handle this internally. Specifically we should decide when we normalize the sequence.
Ah I definitely did not catch the hex formatting! Since I'm doing the preprocessing myself I can easily convert to uppercase. Would be nice to have some docs and a bit more explicit warnings/errors. With some help I can probably submit a PR.
Hi @holmrenser,
I definitely agree. It would be great to handle such cases internally (and then potentially report statistics). The only question here is "when" normalization should take place. Right now the process looks like:
(1) Construct the augmented references (this is where the error you saw occurred)
(2) Compute the relevant signatures of the resulting sequences (for reproducibility & metadata propagation)
(3) Build the index on these sequences
Currently, any normalization is happening after step 1. We could put normalization in step (1) itself, or we could just upgrade the noodles version. The question is if that takes care of all cases we'd wish to handle or not (I think it does).
I can see there's a few considerations that have to be weighed. For now, I can confirm that converting the genome sequences to all uppercase solved the issue. Thanks!