Model, query mutations and threshold questions
Closed this issue · 3 comments
Dear Isaac,
Thank you for sharing your your solution to predict Pango lineage composition in a pooled wastewater sample.
Here are a couple of questions and suggestions.
From the code we see that you fit a linear model between selected mutations abundance in the inputs (Y) and mutation profiles defining selected Pango Lineages (X).
- Why only 12 lineages were selected for querying (
B.1.526.1
,B.1.427
,B.1.617.3
,B.1.617.2
,B.1.526
,B.1.429
,B.1.617
,B.1.1.7
,P.1
,B.1.617.1
,B.1.351
,B.1.526.2
,P.2
)? Could you provide logic on the mutations and lineages selection process? Why these were selected and not all ~1200+ Pango Lineages? - It seems that only lineages with
S:N501Y
mutation were selected, was there a reason for this? Why this filter is necessary? - How Pango lineage mutation profiles were defined? I see that you've used
Outbreak.Info
API queries by lineage such as https://api.outbreak.info/genomics/lineage-mutations?pangolin_lineage=B.1.1.7 to pull data. What that the case? - Does the future iteration of the model would consider multiple testing corrections and intrinsic hierarchical structure existing between lineages given rise to increased error rates and inflated beta coefficients?
- The beta coefficients of Panglo lineage terms are presented as abundances, but are just weights of the model terms. Do you plan also to assign p-values to terms and do some randomization (e.g. permutations) to improve reliability and stability of the prediction?
- Do you also plan to update
setup.py
to automatically setup dependenciesnumpy
,scikit-learn
,matplotlib
seaborn
andpysam
during install? - What if a given query position has more than one mutation? Are you only taking into account abundance of the lineage-associated SNV while disregarding other potential mutations?
- When selecting mutations "strongly" associated with Pango lineages
mut_lins[aa_m][l] > 0.5
, why global mutation prevalence of 0.5 was selected as a cutoff and not more higher values such as 0.9 in case you would like to look for "lineage semi-exclusive" mutations? Would you consider adding this as a command-line parameter? - Have you considered training a tree ensemble classifier (XGBoost, RF) or conditional inference forest (CIF) on the filtered GISAID input?
Good luck with this project and we would be glad to exchange ideas and test other approaches on lineages resolution from the mixed wastewater samples.
Kirill
Hi Kirill,
Thanks for all the feedback. We'll be uploading a preprint soon that addresses some of these but for now:
1/2/3. The lineage information all comes from outbreak.info. We can certainly add more lineages as long as there is reliable data for which mutations they contain.
4. There aren't really multiple testing conditions here. In all cases we are simply looking for VOC abundances that can account for the mutation rates we observe. The hierarchical structure could cause the model to be less accurate on distinguishing related lineages but I think that problem is inherent.
5. The reasoning behind ordinary least squares will be explained in the manuscript. P-values are a good idea. I'll have to do some thinking about how to implement that.
6. I was torn about this since it's a pretty heavy list of requirements and not all functionality requires it, but I think you're right that they should all be in the requirements.
7. The tool will look for all SNVs including multiple SNVs at the same genomic index. If there is more than one SNV that can account for the same amino acid change, it takes the highest rate from among them. Currently it ignores amino acid swaps that require multiple nucleotide changes (such as N:D3L) but I would like to add those soon.
8. This is a good point. The problem is that some mutations are strongly (>0.9) associated with one VOC but weakly (>0.5) associated with another and I didn't want to have to drop those entirely. In practice, I don't think it will matter very much but this is worth thinking about.
9. My original plan was to use an ensemble classifier or SVM classifier but there is a good theoretical basis for OLS here which will be explained in the manuscript.
Hi Isaac,
Pleasure talking to you. As a follow up on our conversation, here are the API links on the key mutations defined by each PANGO Lineage at Outbreak.Info.
E.g. https://api.outbreak.info/genomics/lineage-mutations?pangolin_lineage=B.1.1.7
If you want to know global lineage prevalence
https://api.outbreak.info/genomics/prevalence-by-location?pangolin_lineage=B.1.1.7&cumulative=true
If you want lineage prevalence in Canada only
https://api.outbreak.info/genomics/prevalence-by-location?pangolin_lineage=B.1.1.7&location_id=CAN&fetch_all=true
I had created a small program for Cody to get number of lineages that are shared given a mutation. Here is the link https://github.com/kbessonov1984/OutbreakInfoInterrogator/. What is worth noting is that I get all fresh lineages from Pango Repo at https://raw.githubusercontent.com/cov-lineages/pango-designation/master/lineage_notes.txt and then query Outbreak.Info API with those lineages. Another way to compile all available Pango Lineages is perhaps download GISAID metadata file as Outbreak.Info intrinsically depends on that data, but do not know how frequently do they update and sync.
Hey, likewise! That's really helpful thanks. I'll work on adding support for more (or all) lineages soon. By the way, the preprint is up on medrxiv and I added a link in the README.