Construction resource files
Opened this issue · 6 comments
Dear,
Can some more information be given on how the three csv files for opt$marker (e.g. mutations_list_grouped_pango_codonPhased_2023-02-17_Europe.csv) , opt$smarker (mutations_special_2022-12-21.csv) and opt$pmarker (mutations_problematic_vss1_v3.csv) are made so that I could make custom files?
Thank you
Laura
Hello
The general principle is that we take the multiple sequence alignment from gisaid, extracted the mutations (phased over each codon), link the mutation with the corresponding pango assignment of the respective sequence and report for each "relevant" lineage all the mutations which happen to be found in more than 80% of all sequences in that lineage.
Unfortunately we can not share code of this process as of now since the construction of those files is still a bit more hands-on than we would like it to be. this is mainly due to the fact that the definition of "relevant" lineage must be manually adapted to avoid conflicts. once we have a fully automated way to generate those marker mutations definitions we will include it in the repo.
The files we provide in the repo should be of general applicability to most use cases. we make an update each month, which is more or less synced with the intervals gisaid provides new aligments. new update should arrive still this week.
If you have special need for such a file, drop me a mail and I am sure we can sort this out.
bw, Fabian
Thank you for the explanation. I was wondering if these are then limited to specific time intervals? I'm running into some errors when analysing samples from 2020, 2021 en 2022 in one run? Also when using the latest files, it seems that it was only returning omicron variants?
Hi again
2020 is for sure hard since there are only two pre-alpha lineages in the definition included.
If you really get omicron in pre 2022 I would suggest to ramp up the --minuniqmark option to a value of 2 or even 3.
This is not the case, it's more that with the current databases, I'm unable to analyse samples from 2020-2021 with B.1.1.7 in it.
Hi, @fabou-uobaf. I'm curious whether you've made any more progress on automating your mutation extraction process. If not, could you help me to produce a mutations_list dataset?
My lab is in the process of comparing deconvolution tools on wastewater samples with controlled mixtures of sarscov2 lineages. We're curious what difference it might make to use VaQuERo with a US-based (or even state-specific) mutation dataset rahter than the datasets you already have available. Have you noticed much difference in your results when using your mutations_list_grouped_pango_codonPhased_*_Austria.csv datasets vs Europe datasets? Is there a reason that your last three versions are labeled Austria but the earlier ones are labeled Europe?
Also, is sampling date factored into lineage considerations? I know it's required in the metadata, but it's not exactly relevant in a controlled study like ours. So far, we're seeing fairly large variations between our known lineage abundances and VaQuERo's abundance estimates. Perhaps that could be improved with better utilization of the command line options you offer. I wonder if you could help me see what improvements I can make to my current inputs.
If you want to reach out to me directly to discuss more, I'm available at skunklem@uncc.edu. I look forward to hearing from you.
Hello,
I was wondering whether there is an update on the construction of the resource files so that we can use other datasets than the ones provided?
Kind regards
Laura