update pointfinder database to include all species
Closed this issue · 11 comments
Dear @apetkau,
I tried to implement some control about Pointfinder and staramr analysis.
- Check fasta file presence and overview list
- Check fasta content depending of mutations in each species
- Verify each mutation position into the amino acid and nucleotide sequence
- Convert negative coordinate to positive value
- Create a new file with the "real position"
- Manage deletion and insertion
- Manage bad information in the overview file
I have some trouble for insertion/deletions :
- how STARamr deals with that ? In fact what happen after the blast step if indel is present ?
- the second point is just I need more time to deal with problem in the database build:
- there is some bad information in the overview file (eg. sequence is present with the promoter but position start from the CDS...)
Hello @pimarin
Thanks so much for sending this message and for work you have done related to staramr 👍 . I am very grateful to hear about your efforts to sort out how staramr works and adding support for this feature.
However, please note that I currently have someone working with me on writing the code to add support for additional PointFinder organisms to staramr already. So some of the work you have done may already be in progress or completed.
Could you maybe describe a bit more about how you are approaching solving the problem? Are you trying to convert files in the PointFinder database such as coordinates for promotors (e.g., here https://bitbucket.org/genomicepidemiology/pointfinder_db/src/bfa17543d776faf3962ba1e824dec5f55a66d73b/escherichia_coli/resistens-overview.txt#lines-66:72) to values that better match how staramr already handles coordinates?
The current approach we are taking is to modify the staramr code to handle promotor regions rather than modify the PointFinder database files. This means updating the PointFinder database is much easier since no files need to be changed.
To answer your questions:
how STARamr deals with that [insertions/deletions] ? In fact what happen after the blast step if indel is present ?
Blast should record insertions/deletions. But this is something you would have to sort out yourself with test data as I don't have the exact details.
the second point is just I need more time to deal with problem in the database build:
Which issues are you encountering with database building?
there is some bad information in the overview file (eg. sequence is present with the promoter but position start from the CDS...)
I understand what you mean by bad information. But I wouldn't call it bad so much as it's just in a different coordinate system. Mainly, coordinates are given with 1 starting from the coding region rather than 1 starting from the beginning of the sequence in the FASTA file. The direction I am taking with adding support for additional PointFinder organisms is adding code to staramr to handle converting between these different coordinate systems as staramr is running, rather than pre-processing the database files to modify coordinates.
I hope this helps you out. I am happy to answer any questions you may have and help you understand how staramr works. But, again, the work you are doing is already being handled by my group at this current moment.
Hi @apetka,
I will make a PR to illustrate my suggestion. Maybe it can you give some idea to modify staramr.
My way was to check the pointfinder database, then create a copy of the overview.txt file with past and updated position, but keeping all the previous database. And finaly just add a test to check if this file was created. If not, the code run to create it. This allow to work with any version of installed pointfinder and also to work when database is updated.
Do you have an idea for the next release of the 1.0 version ?
Staramr is a core tool in the French project about AMR studies (a kind of irida-like)
even if I failed to help you, it was a pleasure to learn python to work on your code !
Maybe depending of the policy of your dev community, we can work together ?
Hi @emarinier, I showed your commits and I would like to know if you have a date for the next release of staramr with your modifications ? I saw you integrated the indel, promoter position also.
Best whishes
Hi @emarinier I tested your dev branch to understand what you already done, and when I tested mutation in promoter, the tool detected a mutation but give a codon position in log output, I'm not sure using codon for promoter region is a good idea ?
Hi @emarinier, I showed your commits and I would like to know if you have a date for the next release of staramr with your modifications ? I saw you integrated the indel, promoter position also. Best whishes
Hey, I'm not sure when it'll be released, because there's a lot of testing that needs to be done. The Pointfinder databases have a lot of special cases that need to be checked thoroughly, so it's taking me some time. I think it's getting close though.
Hi @emarinier I tested your dev branch to understand what you already done, and when I tested mutation in promoter, the tool detected a mutation but give a codon position in log output, I'm not sure using codon for promoter region is a good idea ?
Which mutation did you use? The testing I've done shows the results with nucleotide coordinates:
ampC_promoter_size_53bp
Isolate ID Gene Predicted Phenotype Type Position Mutation %Identity %Overlap HSP Length/Total Length Contig Start End
ampC_promoter_size_53bp-Cn42T ampC_promoter_size_53bp (C-42T) unknown[ampC_promoter_size_53bp (C-42T)] nucleotide -42 C -> T 99.31 100.00 145/145 ampC_promoter_size_53bp 1 145
Mutation of a C to a T at position -42 in the promoter.
EDIT: Unless you were talking about a mutation in the "codon" part of the promoter? The promoters in the Pointfinder database have both a nucleotide part and codon part (which presumably is the coding sequence that immediately follows the promoter). That's why there's negative coordinates used when dealing with promoters.
Hi @emarinier,
I tested just it in my input fasta (I highlighted for the comment mutation position with *) :
>ampC_promoter_size_53bp
TGGCTGCTATCCTGACAGTTGTCACGCTGATTGGTGTCGTTACAATCTAACGCATCGCCAATGTAAATCCGGCCCGCCTATGGCGGGCCGTTTTGTATGGAAA*C*CAGACCCTATGTTCAAAACGACGCTCTGCACCTTATTAATT
>ampC_promoter_MUT
TGGCTGCTATCCTGACAGTTGTCACGCTGATTGGTGTCGTTACAATCTAACGCATCGCCAATGTAAATCCGGCCCGCCTATGGCGGGCCGTTTTGTATGGAAA*T*CAGACCCTATGTTCAAAACGACGCTCTGCACCTTATTAATT
In -42 position I added a T instead of a C but nothing happen in the pointfinder output, only one comment in the log :
CodonMutationPosition(_database_amr_gene_start=1, _nucleotide_position_amr_gene=51, _codon_start=17, _database_amr_gene_codon=AAC, _input_genome_codon=AAT)
And If I worked in aminoacid codon position (-42) is for -126 nucleic sequence, I don't have detected mutation. but I retrieve your result when I muted the 12th nucleic position of the sequence. But I retrieve this result on the CGE also and I don't understand where is the real star point
Hi @emarinier, I tested just it in my input fasta (I highlighted for the comment mutation position with *) :
>ampC_promoter_size_53bp TGGCTGCTATCCTGACAGTTGTCACGCTGATTGGTGTCGTTACAATCTAACGCATCGCCAATGTAAATCCGGCCCGCCTATGGCGGGCCGTTTTGTATGGAAA*C*CAGACCCTATGTTCAAAACGACGCTCTGCACCTTATTAATT >ampC_promoter_MUT TGGCTGCTATCCTGACAGTTGTCACGCTGATTGGTGTCGTTACAATCTAACGCATCGCCAATGTAAATCCGGCCCGCCTATGGCGGGCCGTTTTGTATGGAAA*T*CAGACCCTATGTTCAAAACGACGCTCTGCACCTTATTAATT
In -42 position I added a T instead of a C but nothing happen in the pointfinder output, only one comment in the log :
CodonMutationPosition(_database_amr_gene_start=1, _nucleotide_position_amr_gene=51, _codon_start=17, _database_amr_gene_codon=AAC, _input_genome_codon=AAT)
It looks like you put the mutation at (sequence) position 42, not Pointfinder position negative 42.
In promoters, Pointfinder has a negative 0-based nucleotide coordinate space and a positive 1-based codon coordinate space.
In the case of ampC, the negative nucleotide portion is for the first 53 bases, which is derived from the FASTA record name (ampC_promoter_size_53bp). i.e. position "0" starts at sequence position 53. I don't know why they've chosen to do it this way, but these sorts of design decisions are why it's taking time to push this release.
With respect to output, you shouldn't see much of anything if no matching mutation is found in the Pointfinder DB. If you change something arbitrary, it shouldn't show up in the log output. What you're see is probably a print statement that I left in the development branch for debugging, as it's a work in progress development branch.
Hopefully that helps!
Humm I forgot the fact that fasta sequence contains promoter and a gene part..., you right ! So everything work fine now ! But I don't know who manage the decision for information submission, because I imagine you will check fasta name to retrieve the good 0 coordinate ? Ok I'll wait the release so !