pha4ge/hAMRonization

Groot parser implementation

cimendes opened this issue · 5 comments

With the following single-entry output this is currently what is being parsed:

OqxA.3003470.EU370913.4407527-4408202.4553 266 657 3D648M6D

The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", 'reference_database_id': "argannot"}

This is the current output:

    assert result.input_file_name == 'Dummy'
    assert result.gene_symbol == 'OqxA'
    assert result.gene_name == 'OqxA.3003470.EU370913' 
    assert result.reference_database_id == 'argannot'
    assert result.reference_database_version == '2019-Jul-28'
    assert result.reference_accession == 'OqxA.3003470.EU370913.4407527-4408202.4553' 
    assert result.analysis_software_name == 'groot'
    assert result.analysis_software_version == '0.0.1'
    assert result.reference_gene_length == 657
    assert result.coverage_depth == 266

As you can see, the gene_symbol, gene_name and reference_accession fields are all storing the same information.

I've a bit of trouble mapping the fields in OqxA.3003470.EU370913.4407527-4408202.4553 to the spec. The gene_name technically this is not present in the report. Should we store the same value as gene_symbol or keep is as None?
For the reference_accession, shouldn't we keep just the EU370913 value? I'm unsure of what 3003470 represents, as well as the 4407527-4408202.4553.

Any input is welcomed!

I figure because its dependent on what data you create the db with we can't parse too tightly for groot.

gene_name, gene_symbol, reference_accession are mandatory, but as we can't guarantee input formatting we should probably just sling the same thing into gene_symbol, gene_name and reference_accession

3003470 is the ARO accession and I think the other numbers are related to indexed locations in the variation graph and clusters.

Maybe @will-rowe can be of assistance here? :)

Heya. This looks like a great and much needed project! Groot hasn't received much love recently. What do you need? Sounds like @fmaguire is right though - as users can change the input DB, it is going to be hard to write a generic parser? Happy to make updates to groot if needed

Hey @will-rowe! Thanks for joining the discussion! Could you please clarify what is in the groot's report? Maybe adding some headers to the tsv file would be a nice addition.. :P I've quite a bit of trouble mapping it to our AMR spec. (Warning: this is very much WIP!). Both gene symbol and gene name are mandatory fields, and having duplicated information there feel a little bit to me like "cheating". :P I would like to avoid that if possible.

Sorry @cimendes - dropped the ball here.

The report is 4 column tsv where you have:

  • ARG name
  • mapped read count
  • ARG reference length
  • CIGAR to describe reference coverage

I thought this was in the docs but I can't find it - sorry! Will add it. The ARG name is just lifted from whatever input was used for indexing. So in your linked example, that is just the header from the CARD-3.0.4 multifasta.

This does need improving and groot seems to still be going strong so I need to work on this. Open to suggestions on how though. One way to do it could be to have a flag provided to the report subcommand which you can use to sanitise the report based on a database (CARD/resfinder). So it could lookup the multifasta header against your AMR spec for CARD/resfinder. Is there a way to do this already? This also means that if a user didn't use CARD/resfinder, it would fall back to the old behaviour of just using the multifasta header. I'd update the report format to have consistent fields regardless though (possibly just duplicating gene symbol and gene name if sanitisation wasn't possible/requested