lenaschimmel/sc2rf

Deltacrons with NSP3 breakpoint

ktmeaton opened this issue · 17 comments

Thank you for developing this incredible tool! I'm testing it out by reproducing some of the recombinant clades described in pango-designation. So far it's going really well, but I'm struggling with detecting/interpreting Deltacrons with an NSP3 breakpoint. Some context:


So far, BA.1/BA.2 recombination is straightforward to detect and interpret (regardless of breakpoint) (ex. cov-lineages/pango-designation#448)

python3 search_recombinants.py pango-designation_issue_448.fasta 

image


With some parameter tweaking, I can also recover Deltacrons when the breakpoint involves the S gene (ex. cov-lineages/pango-designation#444). So many intermissions though!

python3 search_recombinants.py pango-designation_issue_444.fasta \
  --parents 2-4 \
  --breakpoints 1-4 \
  --max-intermission-count 50 \
  --max-intermission-length 3

image


But I can't reliably detect Deltacrons that do not involve the S gene, such as at NSP3. (ex. cov-lineages/pango-designation#446).

python3 search_recombinants.py pango-designation_issue_446.fasta \
  --parents 2-4 \
  --breakpoints 0-10 \
  --unique 1

Part 1:
image

Part 2:
image

RIPPLES also struggles with these. It identifies the 21I & 21K samples as recombinants (Part 1), and not the 21I & 21J & 21K (Part 2).

My interpretation is that these are "Potential Deltacrons", but there is insufficient nucleotide information to resolve it one way or another. Would you agree with this classification/interpretation?

I will take a closer look on this later… Until then, two quick observations:

First picture

The many intermissions might be caused by the definition of the parent lineages, which might contain too many mutations which only occur rarely. This is discussed in more detail in #4 and is partly fixed in the current version. I can't see which version you are using (also, there are no real "versions" of this tool for now, just irregular updates to the main branch), but I can see that it's not the most current one. Later, there will be a full fix, that lets you select a threshold for the mutation prevalence, so that more or less mutations will be taken as required.

Third picture

There are some very faint grey letters on white ground - poor color choice by me and general issues with the colors on different systems, as noted in #2. But the actual point is: your sequences begin with TT..T.T..G...... which clearly isn't Omicron, and resembles 21I much more than 21J. While G4181T hints at 21J, I'd think that a single mutation is very little evidence for a three-parent-recombination-event. I would suspect that without --parents 2-4 and --unique 1 these would be classified as 21I / BA.1 recombinants and a much clearer picture would be the result.

Oh, and thanks for testing it out!

So far, I could only test very few of the sequences from the pango-designation proposals, because I still don't have access to GISAID and can only test those issues that contain actual fasta-downloads. I love to see how it performs - or doesn't - on input that I could not test.

First picture

The many intermissions might be caused by the definition of the parent lineages, which might contain too many mutations which only occur rarely. This is discussed in more detail in #4 and is partly fixed in the current version. I can't see which version you are using (also, there are no real "versions" of this tool for now, just irregular updates to the main branch), but I can see that it's not the most current one. Later, there will be a full fix, that lets you select a threshold for the mutation prevalence, so that more or less mutations will be taken as required.

This feature is now available. By default, mutations which occur in at least 75% of sequences on cov-spectrum are considered as required (was 70% in earlier versions). But this can be adjusted by using --mutation-threshold NUM where num is between 0.05 and 1.00.

For my BA.1 / BA.2 sequences from Germany, 75% was the best value (everything between 65% and 80% worked fine), but for your sequences the optimum might be something else.

PS: I guess meant "second picture" whenever I wrote "first picture".

It's tempting to think of 4181T as "21J only" and 5584G as "21I only". However, with <30,000 bases to mutate, untold millions of cases, and many millions of sequenced Delta genomes, there are chances for these clade-defining mutations to join each other by recurrent mutations or recombination and show up in some genome sequences. This cov-spectrum query finds a cluster in AY.44 (21J) that has both 4181T and 5584G, as well as mutations 3505T and 4015G shared by the sequences in cov-lineages/pango-designation#446:

https://cov-spectrum.org/explore/World/AllSamples/Past6M/variants?variantQuery=3505T+%26+4015G+%26+5584G+%26+4181T

Genomes with those 4 mutations were found mostly in Nevada, Utah, Arizona and other western states, well into Dec. 2021. So I think the puzzling combo of 4181 and 5584 happened in the AY.44 parent of this recombinant. And I believe the recombination with Omicron happened at some position < 7124.

Before it occurred to me to look for the Delta parent in cov-spectrum, I was really struggling with how to explain 21I, 21J and Omicron all at once. For the record, here's how far I got with that:


For cov-lineages/pango-designation#446 I think it would be best for someone with more sequencing experience to look at the raw read data for at least some of the sequences, especially the ones with 4181T.

I think that SARS-CoV-2 genomes unfortunately have a significant level of contamination & assembly errors, and I think the prior likelihood is low for the same genome to have Omicron and 21I and 21J mutations -- that means either a) a triple-infection, multiple recombination events and forward transmission, or b) a recombination and transmission event and a spontaneous mutation that just happens to be a 21J mutation. 4181 also seems to blink "off and on" within the UT-UPHL sequences in the UCSC/UShER tree, not a good sign that it's reliably detected there.

On the other hand, five different labs (CDCBI, UPHL, CDC-STM, CDC-FG and CDC-QDX) reported seeing 210T, 4181T, 5584G, followed by 8393A and a bunch of other Omicron mutations (minus Spike amplicon dropouts here and there, no surprise for Omicron), and 3505T and 4015G. The probability of that happening by the same combination of error modes in five labs is hopefully vanishingly small. It's almost six labs: UNM reported a sequence, USA/NM-UNM-ED_02539/2022|EPI_ISL_11010751|2022-02-21, that is the same except for an N at 4181. So I believe in all the mutations except for a lingering doubt about 4181.

Regarding reference-allele-for-21J sites near the beginning -- most could be explained as artefacts of #4 that Lena has fixed with --mutation-threshold -- but not 7124T, that should be shared by pretty much all 21Js. (But 21I's 5584G preceeds that anyway; by that point it could be either 21I or Omicron, there's a big region of uncertainty between 5584 and 8393.)

BTW this:
image
looks to me like Spike amplicon dropout, unfortunately very common in Omicron, combined with insufficient filtering out of stray reads from contamination (or just back-filling reference genome sequence?), apparently pretty common in genome assembly pipelines used for SARS-CoV-2 (and the bane of SARS-CoV-2 phylogenetics).

I might be too tired today to really grasp all of what you said, but let me add a few noted and observations:

21I and 21J

You mentioned 21I and 21J and I just noticed that in the current version, both are included in mappings.csv and as such the tool pretends to know them, but in virus_properties.json they are just empty. This is my fault, as I started with a version from nextstrain/nextclade_data (probably this one) and later switched to cov-spectrum. It was easy to map each clade to a pango lineage, except 21I and 21J which both correspond to several AY's each. I'm not sure what exactly to do about it, but it surely can't stay this way. I'll open an Issue for that shortly. See #10.

Amplicon dropouts

Distinguishing actual recombination from amplicon dropouts / co-infections / contamination seems to be the hardest and most important part to me. I recently built some features to help with that, by showing which amplicons cover which region, and where the respective primers are. Obviously, this depends on knowlege about the used primer sets. Do you have that information about those USA/UT-UPHL sequences?

To give you an idea what this feature does, here are several versions of ARTIC and EasySeq primers shown with some suspicious German sequences (just for illustration purposes, obviously they have nothing to do with the Delta/Omicron Recombinants this issue is about). In this case, I did not know which primers where used and just made some educated guesses. And I definitely need to write some documentation on how to use and interpret that, probably tomorrow. (See #11)

primers

Do you think this is a useful feature that could help answer those questions?

21I and 21J

That totally makes sense, yes, it is too bad that there is no direct Pango analog for 21I and 21J.

co-infections / contamination

For distinguishing those from recombinants I think it's best to have access to raw reads (or really, trimmed aligned reads distilled to allele frequencies) so that an an approach like the one in this twitter thread can be applied with plots showing mixture frequencies for co-infection / contam:


image


vs. all-or-nothing frequencies for transmitted recombinant:


image


In the absence of within-sample allele frequencies, however, having a lot of genomes with the same mutations, and one or two clear breakpoints, can give us pretty good confidence that a recombinant is real IMO.

Amplicon dropouts

That is a really cool new feature!

Obviously, this depends on knowlege about the used primer sets. Do you have that information about those USA/UT-UPHL sequences?

I don't have that info about UT-UPHL sequences -- @erinyoung? (specifcally USA/UT-UPHL-220202309634/2022, USA/UT-UPHL-220204125669/2022, UT-UPHL-220204444755/2022, UT-UPHL-220204978864/2022)

By the way, I just added the mindnight primers to the repository. From what I could find quickly, there's only one version of them, unlike ARTIC and EasySeq?

I think Midnight had to change only one primer for Omicron, pretty impressive: https://twitter.com/freed_nikki/status/1489015462570586114

This page has a lot of great info about primers and how they're affected by various lineages: https://primer-monitor.neb.com/lineages

@lenaschimmel Let me look those sequences up

The consensus fastas for the following sequences were generated were generated from single-end Illumina reads with the dragen pipeline.

submission_id,sra_id
UT-UPHL-220202309634,SRR17916009
UT-UPHL-220204125669,SRR17977717
UT-UPHL-220204444755,SRR17977317
UT-UPHL-220204978864,SRR17977192

I'm off to find their amplicon files.

These isolates were all amplified with the V4 primers (not 4.1) with 75 bp fragments

# UT-UPHL-220202309634

# UT-UPHL-220204125669
VIRUS DETECTION METRICS,UT-UPHL-220204125669,SARS-CoV2,+
VIRUS DETECTION METRICS,UT-UPHL-220204125669,SARS-CoV2 Kmers,57956
VIRUS DETECTION METRICS,UT-UPHL-220204125669,SARS-CoV2 Targets probed,99
VIRUS DETECTION METRICS,UT-UPHL-220204125669,SARS-CoV2 Unique kmers detected,420606319
VIRUS DETECTION METRICS,UT-UPHL-220204125669,SARS-CoV2 Targets,98
VIRUS DETECTION METRICS,UT-UPHL-220204125669,SARS-CoV2 Fraction kmers detected,0.951964
VIRUS DETECTION METRICS,UT-UPHL-220204125669,SARS-CoV2 Fraction of Targets detected,0.989899

# UT-UPHL-220204444755
VIRUS DETECTION METRICS,UT-UPHL-220204444755,SARS-CoV2,+
VIRUS DETECTION METRICS,UT-UPHL-220204444755,SARS-CoV2 Kmers,57956
VIRUS DETECTION METRICS,UT-UPHL-220204444755,SARS-CoV2 Targets probed,99
VIRUS DETECTION METRICS,UT-UPHL-220204444755,SARS-CoV2 Unique kmers detected,612614149
VIRUS DETECTION METRICS,UT-UPHL-220204444755,SARS-CoV2 Targets,99
VIRUS DETECTION METRICS,UT-UPHL-220204444755,SARS-CoV2 Fraction kmers detected,0.963024
VIRUS DETECTION METRICS,UT-UPHL-220204444755,SARS-CoV2 Fraction of Targets detected,1

# UT-UPHL-220204978864
VIRUS DETECTION METRICS,UT-UPHL-220204978864,SARS-CoV2,+
VIRUS DETECTION METRICS,UT-UPHL-220204978864,SARS-CoV2 Kmers,57956
VIRUS DETECTION METRICS,UT-UPHL-220204978864,SARS-CoV2 Targets probed,99
VIRUS DETECTION METRICS,UT-UPHL-220204978864,SARS-CoV2 Unique kmers detected,372882465
VIRUS DETECTION METRICS,UT-UPHL-220204978864,SARS-CoV2 Targets,99
VIRUS DETECTION METRICS,UT-UPHL-220204978864,SARS-CoV2 Fraction kmers detected,0.956967
VIRUS DETECTION METRICS,UT-UPHL-220204978864,SARS-CoV2 Fraction of Targets detected,1

Thanks a lot, @erinyoung! I'm not competent to interpret those VIRUS DETECTION METRICS, but ARTIC v4 and a spike from BA.2 for me seems to be a good indicator of amplicon 76 dropout, which is the only one that covers this region:

image

@ktmeaton, or really anyone with access to the Pango #446 sequences, you could run

python3 search_recombinants.py pango-designation_issue_446.fasta \
  --parents 2-4 \
  --breakpoints 0-10 \
  --unique 1
  --primers primers/artic_v4.bed
  --primer-intervals -8400 22882

to get a clearer picture of the primers / amplicons affected, maybe even with --show-private-mutations. I'm not sure if this will deliver any new insight, or only be a fun little exercise to show off my new feature and illustrate what has already been said by me and the others, though :)


Also, I will have a look at #10 (21I / 21J / AY.*) soon. I'm very confident that I can provide a solution within the next few hours.

There's an update on #10 which is also relevant to this issue. See my comment here.

I just wanted to say thank you for this super detailed discussion! :) I'm still thinking and testing it out, but this has helped a lot!